arxiv: 2412.01506 · v3 · submitted 2024-12-02 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Structured 3D Latents for Scalable and Versatile 3D Generation

Jianfeng Xiang , Zelong Lv , Sicheng Xu , Yu Deng , Ruicheng Wang , Bowen Zhang , Dong Chen , Xin Tong

show 1 more author

Jiaolong Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D generationstructured latentsSLATrectified flow transformersmultiview featurestext-to-3Dimage-to-3Dversatile 3D assets

0 comments

The pith

A structured latent that merges sparse 3D grids with dense multiview features supports high-quality generation of 3D assets in multiple output formats from text or image input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a unified Structured LATent representation that encodes both geometry and appearance by combining a sparsely populated 3D grid with dense visual features from a vision foundation model. This single latent space can be decoded into radiance fields, 3D Gaussians, or meshes without retraining the generator. Large rectified-flow transformer models trained on 500K objects reach up to two billion parameters and produce results that exceed prior methods under text or image conditioning. The design also enables local 3D editing, a capability not available in earlier single-format generators.

Core claim

Integrating a sparsely-populated 3D grid with dense multiview visual features from a vision foundation model produces a flexible latent representation that captures both structural geometry and textural appearance, allowing a single trained generator to decode into radiance fields, 3D Gaussians, or meshes while scaling effectively with models up to two billion parameters.

What carries the argument

The Structured LATent (SLAT) representation, formed by fusing a sparse 3D grid with dense multiview features extracted from a vision foundation model to jointly encode geometry and appearance while preserving decoding flexibility.

If this is right

A single generator can output radiance fields, 3D Gaussians, or meshes on demand after training.
Local editing of generated 3D assets becomes feasible without retraining the model.
Model scale up to two billion parameters remains stable when trained on a 500K-object dataset.
Performance surpasses prior methods at similar model sizes under both text and image conditioning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of sparse structural encoding from dense appearance features may simplify downstream tasks such as material editing or animation transfer.
Because the latent supports multiple decoders, the same generator could be paired with new output representations developed after training.
Scaling laws observed for the rectified-flow transformers on this latent may guide further increases in model size for even higher fidelity.

Load-bearing premise

Combining the sparse 3D grid with dense multiview features from a foundation model is sufficient to capture both geometry and appearance without restricting the range of possible output formats.

What would settle it

A controlled benchmark in which the SLAT model produces visibly lower-quality or less consistent 3D assets than recent comparable-scale baselines when evaluated on identical text-to-3D and image-to-3D prompts.

read the original abstract

We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SLAT gives a single latent that decodes to multiple 3D formats and supports local edits, trained at large scale, but the claim that the sparse grid plus vision features is fully sufficient still needs tighter evidence.

read the letter

The main thing to know is that the paper introduces SLAT, a structured latent that puts a sparsely populated 3D grid together with dense multiview features from a vision foundation model. The same latent can then be decoded into radiance fields, 3D Gaussians, or meshes, and it also allows local 3D editing. They train rectified flow transformers up to 2 billion parameters on 500,000 diverse objects and state that text- or image-conditioned outputs beat recent methods at similar scales, with code, models, and data to be released.

Referee Report

2 major / 1 minor

Summary. The paper introduces Structured LATent (SLAT), a unified 3D representation formed by integrating a sparsely-populated 3D grid with dense multiview visual features from a vision foundation model. This latent supports decoding into multiple output formats including Radiance Fields, 3D Gaussians, and meshes. The authors train rectified flow transformers (up to 2B parameters) on a 500K-object dataset and claim that the resulting models produce high-quality 3D assets from text or image conditions, significantly outperforming prior methods at comparable scales while also enabling flexible format selection and local editing.

Significance. If the empirical claims are substantiated, the work would represent a meaningful advance in scalable 3D generation by offering a single latent that preserves both geometry and appearance while supporting multiple downstream decoders. The scale of training (2B parameters on 500K assets) and the promised public release of code, models, and data would further increase its utility for the community.

major comments (2)

[Abstract] Abstract: The central claim that the model 'significantly surpassing existing methods, including recent ones at similar scales' is presented without any quantitative metrics (e.g., FID, PSNR, Chamfer distance), ablation tables, or comparative results. This absence makes the superiority assertion impossible to evaluate and is load-bearing for the paper's main contribution.
[Abstract] Abstract: The assertion that the SLAT representation 'comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding' lacks any supporting analysis, reconstruction-error bounds, or ablation showing that the sparse-grid + multiview-feature combination is information-complete for all three target decoders. If the sparse grid under-samples fine geometry or the vision features misalign with 3D structure, the claimed decoding versatility cannot hold.

minor comments (1)

The acronym SLAT is defined on first use, but subsequent references to 'SLAT' would benefit from a brief reminder of its components when the architecture is first detailed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting areas where the abstract could more clearly substantiate our claims. We address each major comment below with references to the full manuscript and commit to targeted revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the model 'significantly surpassing existing methods, including recent ones at similar scales' is presented without any quantitative metrics (e.g., FID, PSNR, Chamfer distance), ablation tables, or comparative results. This absence makes the superiority assertion impossible to evaluate and is load-bearing for the paper's main contribution.

Authors: We agree the abstract would be stronger with explicit metrics. Section 4 of the manuscript reports quantitative comparisons on standard benchmarks, including FID for generation quality, PSNR for novel-view rendering, and Chamfer distance for geometry accuracy, showing consistent gains over prior methods at comparable model scales (e.g., 1-2B parameters). We will revise the abstract to include concise references to these key metrics and point readers to the corresponding tables and figures. revision: yes
Referee: [Abstract] Abstract: The assertion that the SLAT representation 'comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding' lacks any supporting analysis, reconstruction-error bounds, or ablation showing that the sparse-grid + multiview-feature combination is information-complete for all three target decoders. If the sparse grid under-samples fine geometry or the vision features misalign with 3D structure, the claimed decoding versatility cannot hold.

Authors: Sections 3.2 and 5.1 present reconstruction experiments and ablations that quantify geometry and appearance fidelity when decoding SLAT to Radiance Fields, 3D Gaussians, and meshes. These include per-decoder error metrics and ablation studies on grid sparsity and multiview feature alignment, demonstrating that the hybrid representation preserves the necessary information for high-quality outputs across formats. Formal information-theoretic bounds are not derived, as they are intractable for this learned hybrid latent; we will add a brief discussion of this point and of potential edge cases in sparse sampling. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical representation and training on external data

full rationale

The paper introduces SLAT as a novel latent representation defined by the explicit integration of a sparse 3D grid and dense multiview features from an external vision foundation model. This is presented as an architectural choice trained end-to-end on a 500K-object dataset, with performance claims resting on empirical results rather than any derivation that reduces to fitted parameters, self-referential definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked that collapse the central claims back to the inputs by construction. The approach is self-contained against external benchmarks and does not rely on load-bearing self-citations for its core premise.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only view provides no explicit free parameters or background axioms; the primary addition is the SLAT representation itself.

invented entities (1)

Structured LATent (SLAT) no independent evidence
purpose: Unified representation combining sparse 3D grid and dense multiview features for decoding to multiple 3D formats
Described as the cornerstone of the method that captures structural and textural information while allowing flexible output.

pith-pipeline@v0.9.0 · 5495 in / 1223 out tokens · 43177 ms · 2026-05-16T15:05:40.004428+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats... by integrating a sparsely-populated 3D grid with dense multiview visual features... comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
cs.CV 2026-04 unverdicted novelty 7.0

A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch
cs.CV 2026-04 unverdicted novelty 7.0

A conditional diffusion model using proprioception and multi-contact touch produces metric-scale, physically consistent 3D object reconstructions under hand occlusion.
SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation
cs.CV 2026-04 unverdicted novelty 7.0

SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.
ATATA: One Algorithm to Align Them All
cs.CV 2026-01 unverdicted novelty 7.0

ATATA enables fast joint inference of structurally aligned pairs using Rectified Flow models via segment transport, improving state-of-the-art for image and video generation while matching 3D quality at much higher speed.
Affostruction: 3D Affordance Grounding with Generative Reconstruction
cs.CV 2026-01 unverdicted novelty 7.0

Affostruction reconstructs full 3D object geometry from partial RGBD views and grounds text-based affordances on both visible and unobserved surfaces, reporting large gains over prior methods.
PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World
cs.CV 2026-05 unverdicted novelty 6.0

PhysForge generates physics-grounded 3D assets via a VLM-planned Hierarchical Physical Blueprint and a KineVoxel Injection diffusion model, backed by the new PhysDB dataset of 150,000 annotated assets.
Velox: Learning Representations of 4D Geometry and Appearance
cs.CV 2026-05 unverdicted novelty 6.0

Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
3D-ReGen: A Unified 3D Geometry Regeneration Framework
cs.CV 2026-04 unverdicted novelty 6.0

3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.
REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement
cs.CV 2026-04 unverdicted novelty 6.0

REVIVE 3D generates voluminous 3D assets from flat 2D images via an inflated prior construction followed by latent-space refinement, plus new metrics for volume and flatness validated by user study.
Pair2Scene: Learning Local Object Relations for Procedural Scene Generation
cs.CV 2026-04 unverdicted novelty 6.0

Pair2Scene generates complex 3D scenes beyond training data by recursively applying a learned model of local support and functional object-pair relations inside hierarchies, using collision-aware rejection sampling fo...
Pair2Scene: Learning Local Object Relations for Procedural Scene Generation
cs.CV 2026-04 unverdicted novelty 6.0

Pair2Scene generates complex 3D scenes beyond training data by training a network on local object-pair placement rules and applying them recursively with collision-aware sampling.
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
cs.RO 2026-04 unverdicted novelty 6.0

WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
cs.CV 2026-04 unverdicted novelty 6.0

ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
UniRecGen: Unifying Multi-View 3D Reconstruction and Generation
cs.CV 2026-04 unverdicted novelty 6.0

UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.
MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation
cs.CV 2026-03 unverdicted novelty 6.0

MV-SAM3D adds multi-view fusion via multi-diffusion with attention-entropy and visibility weighting plus physics-aware optimization to improve fidelity and physical plausibility in layout-aware 3D generation.
Depth Anything 3: Recovering the Visual Space from Any Views
cs.CV 2025-11 unverdicted novelty 6.0

DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
Syn4D: A Multiview Synthetic 4D Dataset
cs.CV 2026-05 unverdicted novelty 5.0

Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation
cs.CV 2026-04 unverdicted novelty 5.0

Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples spa...
Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details
cs.CV 2025-06 unverdicted novelty 4.0

Hunyuan3D 2.5's LATTICE model with 10B parameters generates detailed 3D shapes from images and uses multi-view PBR for textures, outperforming prior methods in fidelity and mesh quality.

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages · cited by 18 Pith papers · 6 internal anchors

[1]

Gpt-4o system card. 2024. 6, 16

work page 2024
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Build- ing normalizing flows with stochastic interpolants

Michael Samuel Albergo and Eric Vanden-Eijnden. Build- ing normalizing flows with stochastic interpolants. InICLR,

work page
[4]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 6

work page 2023
[5]

Demystifying mmd gans

Mikołaj Bi ´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. In Interna- tional Conference on Learning Representations , 2018. 8, 18

work page 2018
[6]

Efficient geometry-aware 3d generative adversarial net- works

Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial net- works. In IEEE/CVF International Conference on Com- puter Vision, 2022. 3

work page 2022
[7]

Tensorf: Tensorial radiance fields

Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In European conference on computer vision , pages 333–350. Springer,

work page
[8]

Single-stage dif- fusion nerf: A unified approach to 3d generation and recon- struction

Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage dif- fusion nerf: A unified approach to 3d generation and recon- struction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2416–2425, 2023. 3

work page 2023
[9]

Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image syn- thesis

Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image syn- thesis. In ICLR, 2024. 2

work page 2024
[10]

Meshanything: Artist-created mesh gener- ation with autoregressive transformers

Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Jiaxiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu, et al. Meshanything: Artist-created mesh gener- ation with autoregressive transformers. arXiv preprint arXiv:2406.10163, 2024. 3

work page arXiv 2024
[11]

3dtopia-xl: Scaling high- quality 3d asset generation via primitive diffusion

Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, et al. 3dtopia-xl: Scaling high- quality 3d asset generation via primitive diffusion. arXiv preprint arXiv:2409.12957, 2024. 3, 6, 17

work page arXiv 2024
[12]

Sdfusion: Multimodal 3d shape completion, reconstruction, and generation

Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, Alexan- der G Schwing, and Liang-Yan Gui. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 4456–4465, 2023. 2

work page 2023
[13]

Abo: Dataset and benchmarks for real-world 3d ob- ject understanding

Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d ob- ject understanding. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 21126–21136, 2022. 6, 16

work page 2022
[14]

Flashattention-2: Faster attention with better par- allelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better par- allelism and work partitioning. In ICLR, 2024. 14

work page 2024
[15]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 16

work page 2023
[16]

Objaverse-xl: A universe of 10m+ 3d objects

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- tian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems, 36, 2024. 6, 16

work page 2024
[17]

Gram: Generative radiance manifolds for 3d-aware im- age generation

Yu Deng, Jiaolong Yang, Jianfeng Xiang, and Xin Tong. Gram: Generative radiance manifolds for 3d-aware im- age generation. In IEEE/CVF International Conference on Computer Vision, 2022. 3

work page 2022
[18]

Probing the 3d awareness of visual foundation models

Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Probing the 3d awareness of visual foundation models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition , pages 21795–21806, 2024. 2

work page 2024
[19]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024. 2, 3, 14

work page 2024
[20]

3d-future: 3d furniture shape with texture

Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d furniture shape with texture. International Journal of Com- puter Vision, pages 1–25, 2021. 6, 16

work page 2021
[21]

Get3d: A generative model of high quality 3d tex- tured shapes learned from images

Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja 9 Fidler. Get3d: A generative model of high quality 3d tex- tured shapes learned from images. Advances In Neural In- formation Processing Systems, 35:31841–31854, 2022. 3

work page 2022
[22]

Strivec: Sparse tri-vector radiance fields

Quankai Gao, Qiangeng Xu, Hao Su, Ulrich Neumann, and Zexiang Xu. Strivec: Sparse tri-vector radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17569–17579, 2023. 2, 4

work page 2023
[23]

Visual fact checker: Enabling high-fidelity detailed caption generation

Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung- Yi Lin, Ming-Yu Liu, and Yin Cui. Visual fact checker: Enabling high-fidelity detailed caption generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14033–14042, 2024. 16, 17

work page 2024
[24]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014. 2

work page 2014
[25]

3dgen: Triplane latent diffusion for textured mesh generation

Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas O ˘guz. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023. 2, 3

work page arXiv 2023
[26]

Gvgen: Text-to-3d generation with volumetric representation

Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yang- guang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, and Tong He. Gvgen: Text-to-3d generation with volumetric representation. In ECCV, 2024. 3

work page 2024
[27]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 8, 18

work page 2017
[28]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 6

work page 2021
[29]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. Advances in Neural Informa- tion Processing Systems, 33:6840–6851, 2020. 3

work page 2020
[30]

Lrm: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. In ICLR, 2024. 3

work page 2024
[31]

Neural wavelet-domain diffusion for 3d shape generation

Ka-Hei Hui, Ruihui Li, Jingyu Hu, and Chi-Wing Fu. Neural wavelet-domain diffusion for 3d shape generation. In SIGGRAPH Asia 2022 Conference Papers , pages 1–9,

work page 2022
[32]

Shap-E: Generating Conditional 3D Implicit Functions

Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023. 3, 7

work page internal anchor Pith review arXiv 2023
[33]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1,

work page
[34]

Chang, and Manolis Savva

Mukul Khanna*, Yongsen Mao*, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X. Chang, and Manolis Savva. Habitat Synthetic Scenes Dataset (HSSD-200): An Anal- ysis of 3D Scene Scale and Realism Tradeoffs for Object- Goal Navigation. arXiv preprint, 2023. 6, 16

work page 2023
[35]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4015–4026, 2023. 19

work page 2023
[36]

Modular primitives for high-performance differentiable rendering

Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular primitives for high-performance differentiable rendering. ACM Transac- tions on Graphics (ToG), 39(6):1–14, 2020. 15

work page 2020
[37]

Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation

Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, and Chen Change Loy. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. In ECCV, 2024. 2, 3, 7

work page 2024
[38]

xformers: A modular and hackable trans- former modelling library

Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, Luca Wehrstedt, Jeremy Reizenstein, and Grigory Sizov. xformers: A modular and hackable trans- former modelling library. https://github.com/ facebookresearch/xformers, 2022. 14

work page 2022
[39]

Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. In ICLR, 2024. 3

work page 2024
[40]

arXiv preprint arXiv:2405.14979 , year=

Weiyu Li, Jiarui Liu, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman: High-fidelity mesh generation with 3d native generation and interactive geometry refiner. arXiv preprint arXiv:2405.14979, 2024. 3

work page arXiv 2024
[41]

Generalized deep 3d shape prior via part-discretized diffusion process

Yuhan Li, Yishun Dou, Xuanhong Chen, Bingbing Ni, Yilin Sun, Yutian Liu, and Fuzhen Wang. Generalized deep 3d shape prior via part-discretized diffusion process. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16784–16794, 2023. 2

work page 2023
[42]

Luciddreamer: Towards high- fidelity text-to-3d generation via interval score matching

Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiaogang Xu, and Yingcong Chen. Luciddreamer: Towards high- fidelity text-to-3d generation via interval score matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6517–6526, 2024. 3

work page 2024
[43]

Magic3d: High- resolution text-to-3d content creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High- resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023. 3

work page 2023
[44]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. In ICLR, 2023. 3, 5

work page 2023
[45]

Neural sparse voxel fields

Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields. Advances in Neural Information Processing Systems, 33:15651–15663,

work page
[46]

One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion

Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Ji- ayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. In Proceedings of the IEEE/CVF Conference on 10 Computer Vision and Pattern Recognition , pages 10072– 10083, 2024. 3

work page 2024
[47]

Meshformer: High-quality mesh generation with 3d-guided reconstruction model

Minghua Liu, Chong Zeng, Xinyue Wei, Ruoxi Shi, Ling- hao Chen, Chao Xu, Mengqi Zhang, Zhaoning Wang, Xi- aoshuai Zhang, Isabella Liu, et al. Meshformer: High- quality mesh generation with 3d-guided reconstruction model. arXiv preprint arXiv:2408.10198, 2024

work page arXiv 2024
[48]

Zero-1-to- 3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vi- sion, pages 9298–9309, 2023. 2, 3

work page 2023
[49]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In ICLR, 2023. 3

work page 2023
[50]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 4

work page 2021
[51]

Mix- ture of volumetric primitives for efficient neural rendering

Stephen Lombardi, Tomas Simon, Gabriel Schwartz, Michael Zollhoefer, Yaser Sheikh, and Jason Saragih. Mix- ture of volumetric primitives for efficient neural rendering. ACM Transactions on Graphics (ToG) , 40(4):1–13, 2021. 15

work page 2021
[52]

Wonder3d: Single image to 3d using cross-domain diffusion

Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9970–9980, 2024. 3

work page 2024
[53]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[54]

Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024. 2

work page 2024
[55]

Repaint: In- painting using denoising diffusion probabilistic models

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: In- painting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 11461–11471, 2022. 5

work page 2022
[56]

Diffusion probabilistic mod- els for 3d point cloud generation

Shitong Luo and Wei Hu. Diffusion probabilistic mod- els for 3d point cloud generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2837–2845, 2021. 3

work page 2021
[57]

Scalable 3d captioning with pretrained models

Tiange Luo, Chris Rockwell, Honglak Lee, and Justin Johnson. Scalable 3d captioning with pretrained models. Advances in Neural Information Processing Systems , 36,

work page
[58]

Occupancy net- works: Learning 3d reconstruction in function space

Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- bastian Nowozin, and Andreas Geiger. Occupancy net- works: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4460–4470, 2019. 2

work page 2019
[59]

Nerf: Representing scenes as neural radiance fields for view syn- thesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM, 65(1):99–106, 2021. 2, 15

work page 2021
[60]

V-net: Fully convolutional neural networks for volumetric medical image segmentation

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016. 15

work page 2016
[61]

Diffrf: Rendering-guided 3d radiance field diffusion

Norman M ¨uller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4328–4338, 2023. 3

work page 2023
[62]

Polygen: An autoregressive generative model of 3d meshes

Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. Polygen: An autoregressive generative model of 3d meshes. In International conference on machine learn- ing, pages 7220–7229. PMLR, 2020. 3

work page 2020
[63]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generat- ing 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022. 3, 18

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

Au- todecoding latent 3d diffusion models

Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc V Gool, and Sergey Tulyakov. Au- todecoding latent 3d diffusion models. Advances in Neural Information Processing Systems, 36:67021–67047, 2023. 3

work page 2023
[65]

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El- Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Pa...

work page 2024
[66]

Deepsdf: Learning continuous signed distance functions for shape representa- tion

Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representa- tion. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 165–174, 2019. 2

work page 2019
[67]

Scalable diffusion mod- els with transformers

William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision , pages 4195– 4205, 2023. 5, 8

work page 2023
[68]

Dreamfusion: Text-to-3d using 2d diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR,

work page
[69]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017. 8, 18

work page 2017
[70]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9,

work page
[71]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International conference on machine learning , pages 8748–8763. PMLR, 2021. 5, 8, 18

work page 2021
[72]

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4209–4219, 2024. 2, 3

work page 2024
[73]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 3, 4, 14

work page 2022
[74]

Flexible isosur- face extraction for gradient-based mesh optimization

Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible isosur- face extraction for gradient-based mesh optimization. ACM Trans. Graph., 42(4), 2023. 2, 5, 15

work page 2023
[75]

Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016. 14

work page 2016
[76]

Mvdream: Multi-view diffusion for 3d generation

Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. In ICLR, 2024. 3

work page 2024
[77]

3d neural field gen- eration using triplane diffusion

J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field gen- eration using triplane diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20875–20886, 2023. 3

work page 2023
[78]

3d generation on imagenet

Ivan Skorokhodov, Aliaksandr Siarohin, Yinghao Xu, Jian Ren, Hsin-Ying Lee, Peter Wonka, and Sergey Tulyakov. 3d generation on imagenet. arXiv preprint arXiv:2303.01416,

work page arXiv
[79]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Con- ference on Machine Learning , pages 2256–2265. PMLR,

work page
[80]

Using shape to categorize: Low-shot learning with an explicit shape bias

Stefan Stojanov, Anh Thai, and James M Rehg. Using shape to categorize: Low-shot learning with an explicit shape bias. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1798–1808,

work page

Showing first 80 references.