pith. machine review for the scientific record. sign in

arxiv: 2502.06608 · v3 · submitted 2025-02-10 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D shape generationrectified flowdiffusion modelsimage-to-3D3D VAEhigh-fidelity meshesgenerative modelingdata scaling
0
0 comments X

The pith

TripoSG generates high-fidelity 3D meshes from images via a large-scale rectified flow transformer trained on two million samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TripoSG as a streamlined diffusion-based system for creating 3D shapes that align closely with given images. It centers on a rectified flow transformer scaled up through a custom pipeline that assembles two million high-quality 3D training examples. A 3D variational autoencoder is trained with a hybrid loss that combines signed distance function, normal, and eikonal terms to improve reconstruction accuracy. Experiments demonstrate that this combination produces shapes with finer surface details, tighter image correspondence, and better handling of varied input styles than prior methods.

Core claim

TripoSG achieves state-of-the-art performance in 3D shape generation through a large-scale rectified flow transformer trained on extensive high-quality data, paired with a hybrid supervised training strategy for the 3D VAE that combines SDF, normal, and eikonal losses, yielding meshes with enhanced detail, precise fidelity to input images, and strong generalization across diverse image styles and contents.

What carries the argument

Large-scale rectified flow transformer that directly models the mapping from noise to 3D shape representations conditioned on images.

If this is right

  • High-resolution 3D meshes can be produced directly from single images with greater surface detail than earlier diffusion approaches.
  • The model maintains alignment with input images even when those images vary widely in style and content.
  • Data scale and processing rules become central determinants of quality in 3D generative training.
  • Public release of the trained model enables direct use and further extension by the community.
  • Hybrid losses on SDF, normals, and eikonal terms improve the underlying 3D VAE reconstruction quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could support downstream tasks such as texture mapping or animation once the mesh output is further processed.
  • Scaling the same rectified-flow architecture to dynamic or multi-view 3D data might extend the fidelity gains to video or scene generation.
  • Integration with existing image-generation pipelines could allow end-to-end creation of textured 3D assets from text prompts alone.
  • If the data pipeline generalizes, similar scaling strategies may accelerate progress in other data-scarce 3D domains such as medical imaging or robotics.

Load-bearing premise

The custom data processing pipeline yields two million sufficiently high-quality, diverse, and unbiased 3D training samples that support claimed fidelity and generalization in real-world use.

What would settle it

Generated meshes from a held-out set of real-world photographs show systematic mismatches in fine surface details or fail to preserve image-specific features at the claimed resolution.

read the original abstract

Recent advancements in diffusion techniques have propelled image and video generation to unprecedented levels of quality, significantly accelerating the deployment and application of generative AI. However, 3D shape generation technology has so far lagged behind, constrained by limitations in 3D data scale, complexity of 3D data processing, and insufficient exploration of advanced techniques in the 3D domain. Current approaches to 3D shape generation face substantial challenges in terms of output quality, generalization capability, and alignment with input conditions. We present TripoSG, a new streamlined shape diffusion paradigm capable of generating high-fidelity 3D meshes with precise correspondence to input images. Specifically, we propose: 1) A large-scale rectified flow transformer for 3D shape generation, achieving state-of-the-art fidelity through training on extensive, high-quality data. 2) A hybrid supervised training strategy combining SDF, normal, and eikonal losses for 3D VAE, achieving high-quality 3D reconstruction performance. 3) A data processing pipeline to generate 2 million high-quality 3D samples, highlighting the crucial rules for data quality and quantity in training 3D generative models. Through comprehensive experiments, we have validated the effectiveness of each component in our new framework. The seamless integration of these parts has enabled TripoSG to achieve state-of-the-art performance in 3D shape generation. The resulting 3D shapes exhibit enhanced detail due to high-resolution capabilities and demonstrate exceptional fidelity to input images. Moreover, TripoSG demonstrates improved versatility in generating 3D models from diverse image styles and contents, showcasing strong generalization capabilities. To foster progress and innovation in the field of 3D generation, we will make our model publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces TripoSG, a streamlined 3D shape diffusion framework that uses a large-scale rectified flow transformer trained on 2 million samples, a hybrid VAE with combined SDF/normal/eikonal losses, and a custom data processing pipeline. It claims state-of-the-art performance in high-fidelity mesh generation from images, with improved detail, input fidelity, and generalization across styles and contents, validated through comprehensive experiments and ablations.

Significance. If the empirical claims hold with proper validation, this work would advance 3D generative modeling by scaling rectified flow techniques to large data regimes and highlighting data quality rules, potentially improving applications in graphics, AR/VR, and content creation where current methods lag in fidelity and generalization.

major comments (2)
  1. [Data Processing Pipeline] Data Processing Pipeline section: the claim that the pipeline yields 2 million high-quality, representative 3D samples enabling SOTA performance lacks quantitative support such as diversity statistics versus Objaverse, failure rates on edge cases, or artifact/bias metrics; without these, it is unclear whether reported gains stem from the rectified flow transformer or from data curation artifacts.
  2. [Experiments] Experiments section: the abstract and framework description assert SOTA results from comprehensive experiments and component ablations, yet no specific quantitative metrics, baseline comparisons, error bars, or tabled results are referenced to substantiate the fidelity and generalization claims, leaving the central performance assertion without verifiable grounding.
minor comments (1)
  1. [Abstract] Abstract: key numerical results (e.g., FID, CD, or user study scores) should be included to allow readers to immediately assess the SOTA claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and detailed review of our manuscript. We address each major comment point by point below and will revise the paper to incorporate additional quantitative details where appropriate.

read point-by-point responses
  1. Referee: [Data Processing Pipeline] Data Processing Pipeline section: the claim that the pipeline yields 2 million high-quality, representative 3D samples enabling SOTA performance lacks quantitative support such as diversity statistics versus Objaverse, failure rates on edge cases, or artifact/bias metrics; without these, it is unclear whether reported gains stem from the rectified flow transformer or from data curation artifacts.

    Authors: We agree that additional quantitative validation of the data processing pipeline would strengthen the manuscript. In the revised version, we will expand the Data Processing Pipeline section to include diversity statistics (such as category distributions and geometric complexity metrics) compared against Objaverse, reported failure rates from the filtering stages, and a brief analysis of potential artifacts or biases. These additions will help clarify the contribution of the curated dataset relative to the model architecture. revision: yes

  2. Referee: [Experiments] Experiments section: the abstract and framework description assert SOTA results from comprehensive experiments and component ablations, yet no specific quantitative metrics, baseline comparisons, error bars, or tabled results are referenced to substantiate the fidelity and generalization claims, leaving the central performance assertion without verifiable grounding.

    Authors: We acknowledge that the abstract and framework overview would benefit from more explicit cross-references to the quantitative results. The experiments section already presents detailed metrics (including fidelity measures, baseline comparisons, and ablation studies), but we will revise the abstract to highlight key quantitative outcomes and add direct references to the relevant tables and figures in the framework description. Error bars will be added to applicable plots in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical training and new data pipeline

full rationale

The paper presents TripoSG as a new model trained on 2 million samples from a custom data pipeline, using a rectified flow transformer and hybrid VAE losses (SDF + normal + eikonal). All central claims (SOTA fidelity, generalization) are supported by experimental validation on held-out data rather than any derivation that reduces to fitted inputs, self-definitions, or self-citation chains. The data pipeline is described as an input rather than derived from the model's outputs, and no equations or uniqueness theorems are invoked that loop back to the paper's own results. This is a standard empirical ML paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the transferability of rectified flow techniques to 3D and the sufficiency of the described losses and data rules, which are presented without detailed prior validation or external benchmarks in the abstract.

free parameters (1)
  • Training dataset scale
    The choice of 2 million high-quality 3D samples is presented as critical to achieving fidelity and is a key scale hyperparameter selected for the training process.
axioms (1)
  • domain assumption Rectified flow models transfer effectively to 3D shape generation when scaled with sufficient high-quality data
    The abstract assumes the technique, previously successful in image/video domains, will produce high-fidelity 3D meshes without fundamental domain-specific barriers.

pith-pipeline@v0.9.0 · 5662 in / 1448 out tokens · 51095 ms · 2026-05-16T21:45:37.289109+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

    cs.CR 2026-05 conditional novelty 8.0

    Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...

  2. Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

    cs.CV 2026-04 unverdicted novelty 7.0

    A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.

  3. Make-It-Poseable: Feed-forward Latent Posing Model for 3D Characters

    cs.CV 2025-12 unverdicted novelty 7.0

    A latent-space transformer framework poses 3D characters without skinning or fixed topologies, outperforming baselines and generalizing zero-shot to quadrupeds.

  4. Pixal3D: Pixel-Aligned 3D Generation from Images

    cs.CV 2026-05 unverdicted novelty 6.0

    Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.

  5. DVD: Discrete Voxel Diffusion for 3D Generation and Editing

    cs.CV 2026-05 unverdicted novelty 6.0

    DVD treats voxel occupancy as a discrete variable in a diffusion framework to generate, assess, and edit sparse 3D voxels without continuous thresholding.

  6. PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World

    cs.CV 2026-05 unverdicted novelty 6.0

    PhysForge generates physics-grounded 3D assets via a VLM-planned Hierarchical Physical Blueprint and a KineVoxel Injection diffusion model, backed by the new PhysDB dataset of 150,000 annotated assets.

  7. Velox: Learning Representations of 4D Geometry and Appearance

    cs.CV 2026-05 unverdicted novelty 6.0

    Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...

  8. 3D-ReGen: A Unified 3D Geometry Regeneration Framework

    cs.CV 2026-04 unverdicted novelty 6.0

    3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.

  9. Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data

    cs.CV 2026-04 unverdicted novelty 6.0

    BVE framework enables text-guided 3D editing beyond voxel limits by combining self-constructed data, lightweight semantic injection, and annotation-free masking to preserve local invariance.

  10. ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment

    cs.CV 2026-04 unverdicted novelty 6.0

    ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.

  11. UniRecGen: Unifying Multi-View 3D Reconstruction and Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.

  12. SegviGen: Repurposing 3D Generative Model for Part Segmentation

    cs.CV 2026-03 unverdicted novelty 6.0

    SegviGen shows pretrained 3D generative models can be repurposed for part segmentation via voxel colorization, beating prior methods by 40% interactively and 15% on full segmentation using only 0.32% of labeled data.

  13. MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation

    cs.CV 2026-03 unverdicted novelty 6.0

    MV-SAM3D adds multi-view fusion via multi-diffusion with attention-entropy and visibility weighting plus physics-aware optimization to improve fidelity and physical plausibility in layout-aware 3D generation.

  14. Learn2Fold: Structured Origami Generation with World Model Planning

    cs.GR 2026-02 unverdicted novelty 6.0

    Learn2Fold generates physically valid origami folding sequences from text prompts by decoupling LLM-based program proposals from verification in a learned graph-structured world model.

  15. Pose-Aware Diffusion for 3D Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.

  16. From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation

    cs.GR 2026-04 unverdicted novelty 5.0

    The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.

  17. AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation

    cs.CV 2026-04 unverdicted novelty 4.0

    AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantica...

  18. From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation

    cs.GR 2026-04 unverdicted novelty 4.0

    The paper surveys 3D content generation literature using a taxonomy of asset types and production stages to evaluate progress toward engine-ready assets.

  19. Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation

    cs.GR 2026-04 unverdicted novelty 4.0

    Seed3D 2.0 advances 3D content generation via a coarse-to-fine geometry pipeline, unified PBR material model, and simulation-ready scene tools, reporting 69-89.9% win rates over commercial systems in human studies.

Reference graph

Works this paper leans on

187 extracted references · 187 canonical work pages · cited by 18 Pith papers · 13 internal anchors

  1. [1]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Bain, M., Nagrani, A., Varol, G., and Zisserman, A. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 1728--1738, 2021

  2. [2]

    All are worth words: A vit backbone for diffusion models

    Bao, F., Nie, S., Xue, K., Cao, Y., Li, C., Su, H., and Zhu, J. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 22669--22679, 2023

  3. [4]

    blackforestlabs. Flux. https://github.com/black-forest-labs/flux, 2024

  4. [5]

    Video generation models as world simulators

    Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., and Ramesh, A. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators

  5. [6]

    R., Nagano, K., Chan, M

    Chan, E. R., Nagano, K., Chan, M. A., Bergman, A. W., Park, J. J., Levy, A., Aittala, M., Mello, S. D., Karras, T., and Wetzstein, G. Generative novel view synthesis with 3d-aware diffusion models. In IEEE/CVF ICCV , pp.\ 4194--4206, 2023

  6. [8]

    Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation

    Chen, R., Chen, Y., Jiao, N., and Jia, K. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In IEEE/CVF ICCV , pp.\ 22189--22199, 2023

  7. [9]

    G., and Gui, L

    Cheng, Y., Lee, H., Tulyakov, S., Schwing, A. G., and Gui, L. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In IEEE/CVF CVPR , pp.\ 4456--4465, 2023

  8. [10]

    P., Caron, M., Geirhos, R., Alabdulmohsin, I., et al

    Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A. P., Caron, M., Geirhos, R., Alabdulmohsin, I., et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pp.\ 7480--7512. PMLR, 2023

  9. [11]

    Objaverse: A universe of annotated 3d objects

    Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., and Farhadi, A. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13142--13153, 2023

  10. [12]

    Y., et al

    Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S. Y., et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems, 36, 2024

  11. [13]

    Imagenet: A large-scale hierarchical image database

    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009

  12. [14]

    Scaling rectified flow transformers for high-resolution image synthesis

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., M \"u ller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024

  13. [15]

    Fan, H., Su, H., and Guibas, L. J. A point set generation network for 3d object reconstruction from a single image. In IEEE/CVF CVPR , pp.\ 2463--2471, 2017

  14. [17]

    F., Rodriguez, M., and Gupta, A

    Girdhar, R., Fouhey, D. F., Rodriguez, M., and Gupta, A. Learning a predictable and generative vector representation for objects. In Leibe, B., Matas, J., Sebe, N., and Welling, M. (eds.), ECCV , volume 9910, pp.\ 484--499, 2016

  15. [18]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017

  16. [19]

    Denoising diffusion probabilistic models

    Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), NeurIPS, 2020

  17. [20]

    3dtopia: Large text-to-3d generation model with hybrid diffusion priors

    Hong, F., Tang, J., Cao, Z., Shi, M., Wu, T., Chen, Z., Wang, T., Pan, L., Lin, D., and Liu, Z. 3dtopia: Large text-to-3d generation model with hybrid diffusion priors. CoRR, abs/2403.02234, 2024

  18. [22]

    Huang, J., Su, H., and Guibas, L. J. Robust watertight manifold surface generation method for shapenet models. CoRR, abs/1802.01698, 2018

  19. [23]

    Huang, J., Zhou, Y., and Guibas, L. J. Manifoldplus: A robust and scalable watertight manifold surface generation method for triangle soups. CoRR, abs/2005.11621, 2020

  20. [24]

    Neural wavelet-domain diffusion for 3d shape generation

    Hui, K., Li, R., Hu, J., and Fu, C. Neural wavelet-domain diffusion for 3d shape generation. In Jung, S. K., Lee, J., and Bargteil, A. W. (eds.), SIGGRAPH Asia , pp.\ 24:1--24:9. ACM , 2022

  21. [25]

    Shap-E: Generating Conditional 3D Implicit Functions

    Jun, H. and Nichol, A. Shap-e: Generating conditional 3d implicit functions. CoRR, abs/2305.02463, 2023

  22. [26]

    Lan, Y., Hong, F., Yang, S., Zhou, S., Meng, X., Dai, B., Pan, X., and Loy, C. C. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. In ECCV, 2024

  23. [29]

    Era3d: High-resolution multiview diffusion using efficient row-wise attention

    Li, P., Liu, Y., Long, X., Zhang, F., Lin, C., Li, M., Qi, X., Zhang, S., Luo, W., Tan, P., Wang, W., Liu, Q., and Guo, Y. Era3d: High-resolution multiview diffusion using efficient row-wise attention. CoRR, abs/2405.11616, 2024 b

  24. [31]

    Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching

    Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., and Chen, Y. Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching. In IEEE/CVF CVPR, pp.\ 6517--6526, 2024

  25. [32]

    Magic3d: High-resolution text-to-3d content creation

    Lin, C., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M., and Lin, T. Magic3d: High-resolution text-to-3d content creation. In IEEE CVPR , pp.\ 300--309, 2023

  26. [33]

    V., Xu, Z., and Su, H

    Liu, M., Xu, C., Jin, H., Chen, L., T., M. V., Xu, Z., and Su, H. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. In NeurIPS, 2023 a

  27. [34]

    Meshformer: High-quality mesh generation with 3d-guided reconstruction model

    Liu, M., Zeng, C., Wei, X., Shi, R., Chen, L., Xu, C., Zhang, M., Wang, Z., Zhang, X., Liu, I., Wu, H., and Su, H. Meshformer: High-quality mesh generation with 3d-guided reconstruction model. CoRR, abs/2408.10198, 2024 a

  28. [35]

    V., Tokmakov, P., Zakharov, S., and Vondrick, C

    Liu, R., Wu, R., Hoorick, B. V., Tokmakov, P., Zakharov, S., and Vondrick, C. Zero-1-to-3: Zero-shot one image to 3d object. In IEEE/CVF ICCV , pp.\ 9264--9275, 2023 b

  29. [36]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. In ICLR , 2023 c

  30. [37]

    Syncdreamer: Generating multiview-consistent images from a single-view image

    Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., and Wang, W. Syncdreamer: Generating multiview-consistent images from a single-view image. In ICLR , 2024 b

  31. [39]

    Wonder3d: Single image to 3d using cross-domain diffusion

    Long, X., Guo, Y.-C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.-H., Habermann, M., Theobalt, C., et al. Wonder3d: Single image to 3d using cross-domain diffusion. In IEEE/CVF CVPR, pp.\ 9970--9980, 2024

  32. [40]

    Lorensen, W. E. and Cline, H. E. Marching cubes: A high resolution 3d surface construction algorithm. In Stone, M. C. (ed.), Proceedings of the SIGGRAPH , pp.\ 163--169, 1987

  33. [41]

    Pc\( ^ 2 \): Projection-conditioned point cloud diffusion for single-image 3d reconstruction

    Melas - Kyriazi, L., Rupprecht, C., and Vedaldi, A. Pc\( ^ 2 \): Projection-conditioned point cloud diffusion for single-image 3d reconstruction. In IEEE/CVF CVPR , pp.\ 12923--12932, 2023

  34. [42]

    M., Oechsle, M., Niemeyer, M., Nowozin, S., and Geiger, A

    Mescheder, L. M., Oechsle, M., Niemeyer, M., Nowozin, S., and Geiger, A. Occupancy networks: Learning 3d reconstruction in function space. In IEEE/CVF CVPR , pp.\ 4460--4470, 2019

  35. [43]

    Latent-nerf for shape-guided generation of 3d shapes and textures

    Metzer, G., Richardson, E., Patashnik, O., Giryes, R., and Cohen - Or, D. Latent-nerf for shape-guided generation of 3d shapes and textures. In IEEE/CVF CVPR , pp.\ 12663--12673, 2023

  36. [44]

    P., Tancik, M., Barron, J

    Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., and Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. In Vedaldi, A., Bischof, H., Brox, T., and Frahm, J. (eds.), ECCV, volume 12346, pp.\ 405--421, 2020

  37. [45]

    R., Kontschieder, P., and Nie ner, M

    M \" u ller, N., Siddiqui, Y., Porzi, L., Bul \` o , S. R., Kontschieder, P., and Nie ner, M. Diffrf: Rendering-guided 3d radiance field diffusion. In IEEE/CVF CVPR , pp.\ 4328--4338, 2023

  38. [46]

    A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A

    Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A. J., Kohli, P., Shotton, J., Hodges, S., and Fitzgibbon, A. W. Kinectfusion: Real-time dense surface mapping and tracking. In IEEE ISMAR , pp.\ 127--136, 2011

  39. [47]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., and Chen, M. Point-e: A system for generating 3d point clouds from complex prompts. CoRR, abs/2212.08751, 2022

  40. [48]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023 a

  41. [49]

    Gpt-4v(ision) system card

    OpenAI. Gpt-4v(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023 b . Accessed: 2023-09-25

  42. [51]

    and Xie, S

    Peebles, W. and Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 4195--4205, 2023

  43. [53]

    T., and Mildenhall, B

    Poole, B., Jain, A., Barron, J. T., and Mildenhall, B. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR , 2023

  44. [54]

    W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

  45. [55]

    Zero-shot text-to-image generation

    Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I. Zero-shot text-to-image generation. In ICML , volume 139, pp.\ 8821--8831, 2021

  46. [56]

    Scaling vision with sparse mixture of experts

    Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Susano Pinto, A., Keysers, D., and Houlsby, N. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34: 0 8583--8595, 2021

  47. [57]

    High-resolution image synthesis with latent diffusion models

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022

  48. [58]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 0 25278--25294, 2022

  49. [59]

    Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

    Shi, R., Chen, H., Zhang, Z., Liu, M., Xu, C., Wei, X., Chen, L., Zeng, C., and Su, H. Zero123++: a single image to consistent multi-view diffusion base model. CoRR, abs/2310.15110, 2023 a

  50. [61]

    Mvdream: Multi-view diffusion for 3d generation

    Shi, Y., Wang, P., Ye, J., Mai, L., Li, K., and Yang, X. Mvdream: Multi-view diffusion for 3d generation. In ICLR , 2024

  51. [62]

    R., Chan, E

    Shue, J. R., Chan, E. R., Po, R., Ankner, Z., Wu, J., and Wetzstein, G. 3d neural field generation using triplane diffusion. In IEEE/CVF CVPR , pp.\ 20875--20886, 2023

  52. [63]

    Make-a-video: Text-to-video generation without text-video data

    Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., and Taigman, Y. Make-a-video: Text-to-video generation without text-video data. In ICLR , 2023

  53. [64]

    Make-it-3d: High-fidelity 3d creation from A single image with diffusion prior

    Tang, J., Wang, T., Zhang, B., Zhang, T., Yi, R., Ma, L., and Chen, D. Make-it-3d: High-fidelity 3d creation from A single image with diffusion prior. In IEEE ICCV , pp.\ 22762--22772, 2023

  54. [65]

    Dreamgaussian: Generative gaussian splatting for efficient 3d content creation

    Tang, J., Ren, J., Zhou, H., Liu, Z., and Zeng, G. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. In ICLR , 2024

  55. [66]

    team at Meta, T. M. G. Movie gen: A cast of media foundation models. 2024

  56. [68]

    N., Kaiser, L., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In NeurIPS, pp.\ 5998--6008, 2017

  57. [69]

    Diffusers: State-of-the-art diffusion models

    von Platen, P., Patil, S., Lozhkov, A., Cuenca, P., Lambert, N., Rasul, K., Davaadorj, M., Nair, D., Paul, S., Berman, W., Xu, Y., Liu, S., and Wolf, T. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022

  58. [70]

    A., and Shakhnarovich, G

    Wang, H., Du, X., Li, J., Yeh, R. A., and Shakhnarovich, G. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In IEEE CVPR , pp.\ 12619--12629, 2023 a

  59. [71]

    Pixel2mesh: Generating 3d mesh models from single RGB images

    Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., and Jiang, Y. Pixel2mesh: Generating 3d mesh models from single RGB images. In Ferrari, V., Hebert, M., Sminchisescu, C., and Weiss, Y. (eds.), ECCV , volume 11215, pp.\ 55--71, 2018

  60. [74]

    Dual octree graph networks for learning adaptive volumetric shape representations

    Wang, P., Liu, Y., and Tong, X. Dual octree graph networks for learning adaptive volumetric shape representations. ACM Transactions on Graphics (TOG) , 41 0 (4): 0 103:1--103:15, 2022

  61. [77]

    Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation

    Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., and Zhu, J. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In NeurIPS, 2023 d

  62. [80]

    Multi-view mesh reconstruction with neural deferred shading

    Worchel, M., Diaz, R., Hu, W., Schreer, O., Feldmann, I., and Eisert, P. Multi-view mesh reconstruction with neural deferred shading. In IEEE/CVF CVPR , pp.\ 6177--6187, 2022

  63. [81]

    Marrnet: 3d shape reconstruction via 2.5d sketches

    Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, B., and Tenenbaum, J. Marrnet: 3d shape reconstruction via 2.5d sketches. In NeurIPS, pp.\ 540--550, 2017

  64. [82]

    Z., Ge, Y., Wang, X., Lei, S

    Wu, J. Z., Ge, Y., Wang, X., Lei, S. W., Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., and Shou, M. Z. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In IEEE/CVF ICCV , pp.\ 7589--7599, 2023

  65. [83]

    Unique3d: High-quality and efficient 3d mesh generation from a single image.arXiv preprint arXiv:2405.20343, 2024

    Wu, K., Liu, F., Cai, Z., Yan, R., Wang, H., Hu, Y., Duan, Y., and Ma, K. Unique3d: High-quality and efficient 3d mesh generation from a single image. CoRR, abs/2405.20343, 2024 a

  66. [84]

    PQ-NET: A generative part seq2seq network for 3d shapes

    Wu, R., Zhuang, Y., Xu, K., Zhang, H., and Chen, B. PQ-NET: A generative part seq2seq network for 3d shapes. In IEEE/CVF CVPR , pp.\ 826--835, 2020

  67. [86]

    J., Lin, D., and Wetzstein, G

    Wu, T., Yang, G., Li, Z., Zhang, K., Liu, Z., Guibas, L. J., Lin, D., and Wetzstein, G. Gpt-4v(ision) is a human-aligned evaluator for text-to-3d generation. In IEEE/CVF CVPR , pp.\ 22227--22238, 2024 c

  68. [88]

    DISN: deep implicit surface network for high-quality single-view 3d reconstruction

    Xu, Q., Wang, W., Ceylan, D., Mech, R., and Neumann, U. DISN: deep implicit surface network for high-quality single-view 3d reconstruction. In NeurIPS, pp.\ 490--500, 2019

  69. [90]

    Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models

    Yi, T., Fang, J., Wang, J., Wu, G., Xie, L., Zhang, X., Liu, W., Tian, Q., and Wang, X. Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models. In IEEE CVPR, pp.\ 6796--6807, 2024

  70. [91]

    pixelnerf: Neural radiance fields from one or few images

    Yu, A., Ye, V., Tancik, M., and Kanazawa, A. pixelnerf: Neural radiance fields from one or few images. In IEEE/CVF CVPR , pp.\ 4578--4587, 2021

  71. [92]

    Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp.\ 818--833. Springer, 2014

  72. [93]

    LION: latent point diffusion models for 3d shape generation

    Zeng, X., Vahdat, A., Williams, F., Gojcic, Z., Litany, O., Fidler, S., and Kreis, K. LION: latent point diffusion models for 3d shape generation. In NeurIPS, 2022

  73. [94]

    and Sennrich, R

    Zhang, B. and Sennrich, R. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019

  74. [95]

    3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models

    Zhang, B., Tang, J., Niessner, M., and Wonka, P. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. ACM Transactions on Graphics (TOG), 42 0 (4): 0 1--16, 2023

  75. [97]

    Clay: A controllable large-scale generative model for creating high-quality 3d assets

    Zhang, L., Wang, Z., Zhang, Q., Qiu, Q., Pang, A., Jiang, H., Yang, W., Xu, L., and Yu, J. Clay: A controllable large-scale generative model for creating high-quality 3d assets. ACM Transactions on Graphics (TOG), 43 0 (4): 0 1--20, 2024 b

  76. [98]

    Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation

    Zhao, Z., Liu, W., Chen, X., Zeng, X., Wang, R., Cheng, P., Fu, B., Chen, T., Yu, G., and Gao, S. Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation. Advances in Neural Information Processing Systems, 36, 2024

  77. [99]

    Locally attentional SDF diffusion for controllable 3d shape generation

    Zheng, X., Pan, H., Wang, P., Tong, X., Liu, Y., and Shum, H. Locally attentional SDF diffusion for controllable 3d shape generation. ACM Trans. Graph. , 42 0 (4): 0 91:1--91:13, 2023

  78. [100]

    3d shape generation and completion through point-voxel diffusion

    Zhou, L., Du, Y., and Wu, J. 3d shape generation and completion through point-voxel diffusion. In IEEE/CVF ICCV , pp.\ 5806--5815, 2021

  79. [101]

    Sparse3d: Distilling multiview-consistent diffusion for object reconstruction from sparse views

    Zou, Z., Cheng, W., Cao, Y., Huang, S., Shan, Y., and Zhang, S. Sparse3d: Distilling multiview-consistent diffusion for object reconstruction from sparse views. In AAAI , pp.\ 7900--7908, 2024 a

  80. [102]

    Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers

    Zou, Z.-X., Yu, Z., Guo, Y.-C., Li, Y., Liang, D., Cao, Y.-P., and Zhang, S.-H. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10324--10335, 2024 b

Showing first 80 references.