Recognition: 2 theorem links
TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models
Pith reviewed 2026-05-16 21:45 UTC · model grok-4.3
The pith
TripoSG generates high-fidelity 3D meshes from images via a large-scale rectified flow transformer trained on two million samples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TripoSG achieves state-of-the-art performance in 3D shape generation through a large-scale rectified flow transformer trained on extensive high-quality data, paired with a hybrid supervised training strategy for the 3D VAE that combines SDF, normal, and eikonal losses, yielding meshes with enhanced detail, precise fidelity to input images, and strong generalization across diverse image styles and contents.
What carries the argument
Large-scale rectified flow transformer that directly models the mapping from noise to 3D shape representations conditioned on images.
If this is right
- High-resolution 3D meshes can be produced directly from single images with greater surface detail than earlier diffusion approaches.
- The model maintains alignment with input images even when those images vary widely in style and content.
- Data scale and processing rules become central determinants of quality in 3D generative training.
- Public release of the trained model enables direct use and further extension by the community.
- Hybrid losses on SDF, normals, and eikonal terms improve the underlying 3D VAE reconstruction quality.
Where Pith is reading between the lines
- The method could support downstream tasks such as texture mapping or animation once the mesh output is further processed.
- Scaling the same rectified-flow architecture to dynamic or multi-view 3D data might extend the fidelity gains to video or scene generation.
- Integration with existing image-generation pipelines could allow end-to-end creation of textured 3D assets from text prompts alone.
- If the data pipeline generalizes, similar scaling strategies may accelerate progress in other data-scarce 3D domains such as medical imaging or robotics.
Load-bearing premise
The custom data processing pipeline yields two million sufficiently high-quality, diverse, and unbiased 3D training samples that support claimed fidelity and generalization in real-world use.
What would settle it
Generated meshes from a held-out set of real-world photographs show systematic mismatches in fine surface details or fail to preserve image-specific features at the claimed resolution.
read the original abstract
Recent advancements in diffusion techniques have propelled image and video generation to unprecedented levels of quality, significantly accelerating the deployment and application of generative AI. However, 3D shape generation technology has so far lagged behind, constrained by limitations in 3D data scale, complexity of 3D data processing, and insufficient exploration of advanced techniques in the 3D domain. Current approaches to 3D shape generation face substantial challenges in terms of output quality, generalization capability, and alignment with input conditions. We present TripoSG, a new streamlined shape diffusion paradigm capable of generating high-fidelity 3D meshes with precise correspondence to input images. Specifically, we propose: 1) A large-scale rectified flow transformer for 3D shape generation, achieving state-of-the-art fidelity through training on extensive, high-quality data. 2) A hybrid supervised training strategy combining SDF, normal, and eikonal losses for 3D VAE, achieving high-quality 3D reconstruction performance. 3) A data processing pipeline to generate 2 million high-quality 3D samples, highlighting the crucial rules for data quality and quantity in training 3D generative models. Through comprehensive experiments, we have validated the effectiveness of each component in our new framework. The seamless integration of these parts has enabled TripoSG to achieve state-of-the-art performance in 3D shape generation. The resulting 3D shapes exhibit enhanced detail due to high-resolution capabilities and demonstrate exceptional fidelity to input images. Moreover, TripoSG demonstrates improved versatility in generating 3D models from diverse image styles and contents, showcasing strong generalization capabilities. To foster progress and innovation in the field of 3D generation, we will make our model publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TripoSG, a streamlined 3D shape diffusion framework that uses a large-scale rectified flow transformer trained on 2 million samples, a hybrid VAE with combined SDF/normal/eikonal losses, and a custom data processing pipeline. It claims state-of-the-art performance in high-fidelity mesh generation from images, with improved detail, input fidelity, and generalization across styles and contents, validated through comprehensive experiments and ablations.
Significance. If the empirical claims hold with proper validation, this work would advance 3D generative modeling by scaling rectified flow techniques to large data regimes and highlighting data quality rules, potentially improving applications in graphics, AR/VR, and content creation where current methods lag in fidelity and generalization.
major comments (2)
- [Data Processing Pipeline] Data Processing Pipeline section: the claim that the pipeline yields 2 million high-quality, representative 3D samples enabling SOTA performance lacks quantitative support such as diversity statistics versus Objaverse, failure rates on edge cases, or artifact/bias metrics; without these, it is unclear whether reported gains stem from the rectified flow transformer or from data curation artifacts.
- [Experiments] Experiments section: the abstract and framework description assert SOTA results from comprehensive experiments and component ablations, yet no specific quantitative metrics, baseline comparisons, error bars, or tabled results are referenced to substantiate the fidelity and generalization claims, leaving the central performance assertion without verifiable grounding.
minor comments (1)
- [Abstract] Abstract: key numerical results (e.g., FID, CD, or user study scores) should be included to allow readers to immediately assess the SOTA claim.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and detailed review of our manuscript. We address each major comment point by point below and will revise the paper to incorporate additional quantitative details where appropriate.
read point-by-point responses
-
Referee: [Data Processing Pipeline] Data Processing Pipeline section: the claim that the pipeline yields 2 million high-quality, representative 3D samples enabling SOTA performance lacks quantitative support such as diversity statistics versus Objaverse, failure rates on edge cases, or artifact/bias metrics; without these, it is unclear whether reported gains stem from the rectified flow transformer or from data curation artifacts.
Authors: We agree that additional quantitative validation of the data processing pipeline would strengthen the manuscript. In the revised version, we will expand the Data Processing Pipeline section to include diversity statistics (such as category distributions and geometric complexity metrics) compared against Objaverse, reported failure rates from the filtering stages, and a brief analysis of potential artifacts or biases. These additions will help clarify the contribution of the curated dataset relative to the model architecture. revision: yes
-
Referee: [Experiments] Experiments section: the abstract and framework description assert SOTA results from comprehensive experiments and component ablations, yet no specific quantitative metrics, baseline comparisons, error bars, or tabled results are referenced to substantiate the fidelity and generalization claims, leaving the central performance assertion without verifiable grounding.
Authors: We acknowledge that the abstract and framework overview would benefit from more explicit cross-references to the quantitative results. The experiments section already presents detailed metrics (including fidelity measures, baseline comparisons, and ablation studies), but we will revise the abstract to highlight key quantitative outcomes and add direct references to the relevant tables and figures in the framework description. Error bars will be added to applicable plots in the revision. revision: yes
Circularity Check
No significant circularity; claims rest on empirical training and new data pipeline
full rationale
The paper presents TripoSG as a new model trained on 2 million samples from a custom data pipeline, using a rectified flow transformer and hybrid VAE losses (SDF + normal + eikonal). All central claims (SOTA fidelity, generalization) are supported by experimental validation on held-out data rather than any derivation that reduces to fitted inputs, self-definitions, or self-citation chains. The data pipeline is described as an input rather than derived from the model's outputs, and no equations or uniqueness theorems are invoked that loop back to the paper's own results. This is a standard empirical ML paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- Training dataset scale
axioms (1)
- domain assumption Rectified flow models transfer effectively to 3D shape generation when scaled with sufficient high-quality data
Forward citations
Cited by 19 Pith papers
-
On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...
-
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
-
Make-It-Poseable: Feed-forward Latent Posing Model for 3D Characters
A latent-space transformer framework poses 3D characters without skinning or fixed topologies, outperforming baselines and generalizing zero-shot to quadrupeds.
-
Pixal3D: Pixel-Aligned 3D Generation from Images
Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.
-
DVD: Discrete Voxel Diffusion for 3D Generation and Editing
DVD treats voxel occupancy as a discrete variable in a diffusion framework to generate, assess, and edit sparse 3D voxels without continuous thresholding.
-
PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World
PhysForge generates physics-grounded 3D assets via a VLM-planned Hierarchical Physical Blueprint and a KineVoxel Injection diffusion model, backed by the new PhysDB dataset of 150,000 annotated assets.
-
Velox: Learning Representations of 4D Geometry and Appearance
Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
-
3D-ReGen: A Unified 3D Geometry Regeneration Framework
3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.
-
Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data
BVE framework enables text-guided 3D editing beyond voxel limits by combining self-constructed data, lightweight semantic injection, and annotation-free masking to preserve local invariance.
-
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
-
UniRecGen: Unifying Multi-View 3D Reconstruction and Generation
UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.
-
SegviGen: Repurposing 3D Generative Model for Part Segmentation
SegviGen shows pretrained 3D generative models can be repurposed for part segmentation via voxel colorization, beating prior methods by 40% interactively and 15% on full segmentation using only 0.32% of labeled data.
-
MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation
MV-SAM3D adds multi-view fusion via multi-diffusion with attention-entropy and visibility weighting plus physics-aware optimization to improve fidelity and physical plausibility in layout-aware 3D generation.
-
Learn2Fold: Structured Origami Generation with World Model Planning
Learn2Fold generates physically valid origami folding sequences from text prompts by decoupling LLM-based program proposals from verification in a learned graph-structured world model.
-
Pose-Aware Diffusion for 3D Generation
PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
-
From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation
The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.
-
AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation
AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantica...
-
From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation
The paper surveys 3D content generation literature using a taxonomy of asset types and production stages to evaluate progress toward engine-ready assets.
-
Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation
Seed3D 2.0 advances 3D content generation via a coarse-to-fine geometry pipeline, unified PBR material model, and simulation-ready scene tools, reporting 69-89.9% win rates over commercial systems in human studies.
Reference graph
Works this paper leans on
-
[1]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Bain, M., Nagrani, A., Varol, G., and Zisserman, A. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 1728--1738, 2021
work page 2021
-
[2]
All are worth words: A vit backbone for diffusion models
Bao, F., Nie, S., Xue, K., Cao, Y., Li, C., Su, H., and Zhu, J. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 22669--22679, 2023
work page 2023
-
[4]
blackforestlabs. Flux. https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[5]
Video generation models as world simulators
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., and Ramesh, A. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators
work page 2024
-
[6]
Chan, E. R., Nagano, K., Chan, M. A., Bergman, A. W., Park, J. J., Levy, A., Aittala, M., Mello, S. D., Karras, T., and Wetzstein, G. Generative novel view synthesis with 3d-aware diffusion models. In IEEE/CVF ICCV , pp.\ 4194--4206, 2023
work page 2023
-
[8]
Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation
Chen, R., Chen, Y., Jiao, N., and Jia, K. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In IEEE/CVF ICCV , pp.\ 22189--22199, 2023
work page 2023
-
[9]
Cheng, Y., Lee, H., Tulyakov, S., Schwing, A. G., and Gui, L. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In IEEE/CVF CVPR , pp.\ 4456--4465, 2023
work page 2023
-
[10]
P., Caron, M., Geirhos, R., Alabdulmohsin, I., et al
Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A. P., Caron, M., Geirhos, R., Alabdulmohsin, I., et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pp.\ 7480--7512. PMLR, 2023
work page 2023
-
[11]
Objaverse: A universe of annotated 3d objects
Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., and Farhadi, A. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13142--13153, 2023
work page 2023
- [12]
-
[13]
Imagenet: A large-scale hierarchical image database
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009
work page 2009
-
[14]
Scaling rectified flow transformers for high-resolution image synthesis
Esser, P., Kulal, S., Blattmann, A., Entezari, R., M \"u ller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[15]
Fan, H., Su, H., and Guibas, L. J. A point set generation network for 3d object reconstruction from a single image. In IEEE/CVF CVPR , pp.\ 2463--2471, 2017
work page 2017
-
[17]
F., Rodriguez, M., and Gupta, A
Girdhar, R., Fouhey, D. F., Rodriguez, M., and Gupta, A. Learning a predictable and generative vector representation for objects. In Leibe, B., Matas, J., Sebe, N., and Welling, M. (eds.), ECCV , volume 9910, pp.\ 484--499, 2016
work page 2016
-
[18]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017
work page 2017
-
[19]
Denoising diffusion probabilistic models
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), NeurIPS, 2020
work page 2020
-
[20]
3dtopia: Large text-to-3d generation model with hybrid diffusion priors
Hong, F., Tang, J., Cao, Z., Shi, M., Wu, T., Chen, Z., Wang, T., Pan, L., Lin, D., and Liu, Z. 3dtopia: Large text-to-3d generation model with hybrid diffusion priors. CoRR, abs/2403.02234, 2024
-
[22]
Huang, J., Su, H., and Guibas, L. J. Robust watertight manifold surface generation method for shapenet models. CoRR, abs/1802.01698, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [23]
-
[24]
Neural wavelet-domain diffusion for 3d shape generation
Hui, K., Li, R., Hu, J., and Fu, C. Neural wavelet-domain diffusion for 3d shape generation. In Jung, S. K., Lee, J., and Bargteil, A. W. (eds.), SIGGRAPH Asia , pp.\ 24:1--24:9. ACM , 2022
work page 2022
-
[25]
Shap-E: Generating Conditional 3D Implicit Functions
Jun, H. and Nichol, A. Shap-e: Generating conditional 3d implicit functions. CoRR, abs/2305.02463, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Lan, Y., Hong, F., Yang, S., Zhou, S., Meng, X., Dai, B., Pan, X., and Loy, C. C. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. In ECCV, 2024
work page 2024
-
[29]
Era3d: High-resolution multiview diffusion using efficient row-wise attention
Li, P., Liu, Y., Long, X., Zhang, F., Lin, C., Li, M., Qi, X., Zhang, S., Luo, W., Tan, P., Wang, W., Liu, Q., and Guo, Y. Era3d: High-resolution multiview diffusion using efficient row-wise attention. CoRR, abs/2405.11616, 2024 b
-
[31]
Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching
Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., and Chen, Y. Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching. In IEEE/CVF CVPR, pp.\ 6517--6526, 2024
work page 2024
-
[32]
Magic3d: High-resolution text-to-3d content creation
Lin, C., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M., and Lin, T. Magic3d: High-resolution text-to-3d content creation. In IEEE CVPR , pp.\ 300--309, 2023
work page 2023
-
[33]
Liu, M., Xu, C., Jin, H., Chen, L., T., M. V., Xu, Z., and Su, H. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. In NeurIPS, 2023 a
work page 2023
-
[34]
Meshformer: High-quality mesh generation with 3d-guided reconstruction model
Liu, M., Zeng, C., Wei, X., Shi, R., Chen, L., Xu, C., Zhang, M., Wang, Z., Zhang, X., Liu, I., Wu, H., and Su, H. Meshformer: High-quality mesh generation with 3d-guided reconstruction model. CoRR, abs/2408.10198, 2024 a
-
[35]
V., Tokmakov, P., Zakharov, S., and Vondrick, C
Liu, R., Wu, R., Hoorick, B. V., Tokmakov, P., Zakharov, S., and Vondrick, C. Zero-1-to-3: Zero-shot one image to 3d object. In IEEE/CVF ICCV , pp.\ 9264--9275, 2023 b
work page 2023
-
[36]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. In ICLR , 2023 c
work page 2023
-
[37]
Syncdreamer: Generating multiview-consistent images from a single-view image
Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., and Wang, W. Syncdreamer: Generating multiview-consistent images from a single-view image. In ICLR , 2024 b
work page 2024
-
[39]
Wonder3d: Single image to 3d using cross-domain diffusion
Long, X., Guo, Y.-C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.-H., Habermann, M., Theobalt, C., et al. Wonder3d: Single image to 3d using cross-domain diffusion. In IEEE/CVF CVPR, pp.\ 9970--9980, 2024
work page 2024
-
[40]
Lorensen, W. E. and Cline, H. E. Marching cubes: A high resolution 3d surface construction algorithm. In Stone, M. C. (ed.), Proceedings of the SIGGRAPH , pp.\ 163--169, 1987
work page 1987
-
[41]
Pc\( ^ 2 \): Projection-conditioned point cloud diffusion for single-image 3d reconstruction
Melas - Kyriazi, L., Rupprecht, C., and Vedaldi, A. Pc\( ^ 2 \): Projection-conditioned point cloud diffusion for single-image 3d reconstruction. In IEEE/CVF CVPR , pp.\ 12923--12932, 2023
work page 2023
-
[42]
M., Oechsle, M., Niemeyer, M., Nowozin, S., and Geiger, A
Mescheder, L. M., Oechsle, M., Niemeyer, M., Nowozin, S., and Geiger, A. Occupancy networks: Learning 3d reconstruction in function space. In IEEE/CVF CVPR , pp.\ 4460--4470, 2019
work page 2019
-
[43]
Latent-nerf for shape-guided generation of 3d shapes and textures
Metzer, G., Richardson, E., Patashnik, O., Giryes, R., and Cohen - Or, D. Latent-nerf for shape-guided generation of 3d shapes and textures. In IEEE/CVF CVPR , pp.\ 12663--12673, 2023
work page 2023
-
[44]
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., and Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. In Vedaldi, A., Bischof, H., Brox, T., and Frahm, J. (eds.), ECCV, volume 12346, pp.\ 405--421, 2020
work page 2020
-
[45]
R., Kontschieder, P., and Nie ner, M
M \" u ller, N., Siddiqui, Y., Porzi, L., Bul \` o , S. R., Kontschieder, P., and Nie ner, M. Diffrf: Rendering-guided 3d radiance field diffusion. In IEEE/CVF CVPR , pp.\ 4328--4338, 2023
work page 2023
-
[46]
A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A
Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A. J., Kohli, P., Shotton, J., Hodges, S., and Fitzgibbon, A. W. Kinectfusion: Real-time dense surface mapping and tracking. In IEEE ISMAR , pp.\ 127--136, 2011
work page 2011
-
[47]
Point-E: A System for Generating 3D Point Clouds from Complex Prompts
Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., and Chen, M. Point-e: A system for generating 3d point clouds from complex prompts. CoRR, abs/2212.08751, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[48]
OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
OpenAI. Gpt-4v(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023 b . Accessed: 2023-09-25
work page 2023
-
[51]
Peebles, W. and Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 4195--4205, 2023
work page 2023
-
[53]
Poole, B., Jain, A., Barron, J. T., and Mildenhall, B. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR , 2023
work page 2023
-
[54]
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021
work page 2021
-
[55]
Zero-shot text-to-image generation
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I. Zero-shot text-to-image generation. In ICML , volume 139, pp.\ 8821--8831, 2021
work page 2021
-
[56]
Scaling vision with sparse mixture of experts
Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Susano Pinto, A., Keysers, D., and Houlsby, N. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34: 0 8583--8595, 2021
work page 2021
-
[57]
High-resolution image synthesis with latent diffusion models
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022
work page 2022
-
[58]
Laion-5b: An open large-scale dataset for training next generation image-text models
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 0 25278--25294, 2022
work page 2022
-
[59]
Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model
Shi, R., Chen, H., Zhang, Z., Liu, M., Xu, C., Wei, X., Chen, L., Zeng, C., and Su, H. Zero123++: a single image to consistent multi-view diffusion base model. CoRR, abs/2310.15110, 2023 a
work page internal anchor Pith review arXiv 2023
-
[61]
Mvdream: Multi-view diffusion for 3d generation
Shi, Y., Wang, P., Ye, J., Mai, L., Li, K., and Yang, X. Mvdream: Multi-view diffusion for 3d generation. In ICLR , 2024
work page 2024
-
[62]
Shue, J. R., Chan, E. R., Po, R., Ankner, Z., Wu, J., and Wetzstein, G. 3d neural field generation using triplane diffusion. In IEEE/CVF CVPR , pp.\ 20875--20886, 2023
work page 2023
-
[63]
Make-a-video: Text-to-video generation without text-video data
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., and Taigman, Y. Make-a-video: Text-to-video generation without text-video data. In ICLR , 2023
work page 2023
-
[64]
Make-it-3d: High-fidelity 3d creation from A single image with diffusion prior
Tang, J., Wang, T., Zhang, B., Zhang, T., Yi, R., Ma, L., and Chen, D. Make-it-3d: High-fidelity 3d creation from A single image with diffusion prior. In IEEE ICCV , pp.\ 22762--22772, 2023
work page 2023
-
[65]
Dreamgaussian: Generative gaussian splatting for efficient 3d content creation
Tang, J., Ren, J., Zhou, H., Liu, Z., and Zeng, G. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. In ICLR , 2024
work page 2024
-
[66]
team at Meta, T. M. G. Movie gen: A cast of media foundation models. 2024
work page 2024
-
[68]
N., Kaiser, L., and Polosukhin, I
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In NeurIPS, pp.\ 5998--6008, 2017
work page 2017
-
[69]
Diffusers: State-of-the-art diffusion models
von Platen, P., Patil, S., Lozhkov, A., Cuenca, P., Lambert, N., Rasul, K., Davaadorj, M., Nair, D., Paul, S., Berman, W., Xu, Y., Liu, S., and Wolf, T. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022
work page 2022
-
[70]
Wang, H., Du, X., Li, J., Yeh, R. A., and Shakhnarovich, G. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In IEEE CVPR , pp.\ 12619--12629, 2023 a
work page 2023
-
[71]
Pixel2mesh: Generating 3d mesh models from single RGB images
Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., and Jiang, Y. Pixel2mesh: Generating 3d mesh models from single RGB images. In Ferrari, V., Hebert, M., Sminchisescu, C., and Weiss, Y. (eds.), ECCV , volume 11215, pp.\ 55--71, 2018
work page 2018
-
[74]
Dual octree graph networks for learning adaptive volumetric shape representations
Wang, P., Liu, Y., and Tong, X. Dual octree graph networks for learning adaptive volumetric shape representations. ACM Transactions on Graphics (TOG) , 41 0 (4): 0 103:1--103:15, 2022
work page 2022
-
[77]
Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation
Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., and Zhu, J. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In NeurIPS, 2023 d
work page 2023
-
[80]
Multi-view mesh reconstruction with neural deferred shading
Worchel, M., Diaz, R., Hu, W., Schreer, O., Feldmann, I., and Eisert, P. Multi-view mesh reconstruction with neural deferred shading. In IEEE/CVF CVPR , pp.\ 6177--6187, 2022
work page 2022
-
[81]
Marrnet: 3d shape reconstruction via 2.5d sketches
Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, B., and Tenenbaum, J. Marrnet: 3d shape reconstruction via 2.5d sketches. In NeurIPS, pp.\ 540--550, 2017
work page 2017
-
[82]
Wu, J. Z., Ge, Y., Wang, X., Lei, S. W., Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., and Shou, M. Z. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In IEEE/CVF ICCV , pp.\ 7589--7599, 2023
work page 2023
-
[83]
Wu, K., Liu, F., Cai, Z., Yan, R., Wang, H., Hu, Y., Duan, Y., and Ma, K. Unique3d: High-quality and efficient 3d mesh generation from a single image. CoRR, abs/2405.20343, 2024 a
-
[84]
PQ-NET: A generative part seq2seq network for 3d shapes
Wu, R., Zhuang, Y., Xu, K., Zhang, H., and Chen, B. PQ-NET: A generative part seq2seq network for 3d shapes. In IEEE/CVF CVPR , pp.\ 826--835, 2020
work page 2020
-
[86]
Wu, T., Yang, G., Li, Z., Zhang, K., Liu, Z., Guibas, L. J., Lin, D., and Wetzstein, G. Gpt-4v(ision) is a human-aligned evaluator for text-to-3d generation. In IEEE/CVF CVPR , pp.\ 22227--22238, 2024 c
work page 2024
-
[88]
DISN: deep implicit surface network for high-quality single-view 3d reconstruction
Xu, Q., Wang, W., Ceylan, D., Mech, R., and Neumann, U. DISN: deep implicit surface network for high-quality single-view 3d reconstruction. In NeurIPS, pp.\ 490--500, 2019
work page 2019
-
[90]
Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models
Yi, T., Fang, J., Wang, J., Wu, G., Xie, L., Zhang, X., Liu, W., Tian, Q., and Wang, X. Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models. In IEEE CVPR, pp.\ 6796--6807, 2024
work page 2024
-
[91]
pixelnerf: Neural radiance fields from one or few images
Yu, A., Ye, V., Tancik, M., and Kanazawa, A. pixelnerf: Neural radiance fields from one or few images. In IEEE/CVF CVPR , pp.\ 4578--4587, 2021
work page 2021
-
[92]
Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp.\ 818--833. Springer, 2014
work page 2014
-
[93]
LION: latent point diffusion models for 3d shape generation
Zeng, X., Vahdat, A., Williams, F., Gojcic, Z., Litany, O., Fidler, S., and Kreis, K. LION: latent point diffusion models for 3d shape generation. In NeurIPS, 2022
work page 2022
-
[94]
Zhang, B. and Sennrich, R. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[95]
3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models
Zhang, B., Tang, J., Niessner, M., and Wonka, P. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. ACM Transactions on Graphics (TOG), 42 0 (4): 0 1--16, 2023
work page 2023
-
[97]
Clay: A controllable large-scale generative model for creating high-quality 3d assets
Zhang, L., Wang, Z., Zhang, Q., Qiu, Q., Pang, A., Jiang, H., Yang, W., Xu, L., and Yu, J. Clay: A controllable large-scale generative model for creating high-quality 3d assets. ACM Transactions on Graphics (TOG), 43 0 (4): 0 1--20, 2024 b
work page 2024
-
[98]
Zhao, Z., Liu, W., Chen, X., Zeng, X., Wang, R., Cheng, P., Fu, B., Chen, T., Yu, G., and Gao, S. Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[99]
Locally attentional SDF diffusion for controllable 3d shape generation
Zheng, X., Pan, H., Wang, P., Tong, X., Liu, Y., and Shum, H. Locally attentional SDF diffusion for controllable 3d shape generation. ACM Trans. Graph. , 42 0 (4): 0 91:1--91:13, 2023
work page 2023
-
[100]
3d shape generation and completion through point-voxel diffusion
Zhou, L., Du, Y., and Wu, J. 3d shape generation and completion through point-voxel diffusion. In IEEE/CVF ICCV , pp.\ 5806--5815, 2021
work page 2021
-
[101]
Sparse3d: Distilling multiview-consistent diffusion for object reconstruction from sparse views
Zou, Z., Cheng, W., Cao, Y., Huang, S., Shan, Y., and Zhang, S. Sparse3d: Distilling multiview-consistent diffusion for object reconstruction from sparse views. In AAAI , pp.\ 7900--7908, 2024 a
work page 2024
-
[102]
Zou, Z.-X., Yu, Z., Guo, Y.-C., Li, Y., Liang, D., Cao, Y.-P., and Zhang, S.-H. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10324--10335, 2024 b
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.