arxiv: 2502.06608 · v3 · submitted 2025-02-10 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

Yangguang Li , Zi-Xin Zou , Zexiang Liu , Dehu Wang , Yuan Liang , Zhipeng Yu , Xingchao Liu , Yuan-Chen Guo

show 3 more authors

Ding Liang Wanli Ouyang Yan-Pei Cao

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords 3D shape generationrectified flowdiffusion modelsimage-to-3D3D VAEhigh-fidelity meshesgenerative modelingdata scaling

0 comments

The pith

TripoSG generates high-fidelity 3D meshes from images via a large-scale rectified flow transformer trained on two million samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TripoSG as a streamlined diffusion-based system for creating 3D shapes that align closely with given images. It centers on a rectified flow transformer scaled up through a custom pipeline that assembles two million high-quality 3D training examples. A 3D variational autoencoder is trained with a hybrid loss that combines signed distance function, normal, and eikonal terms to improve reconstruction accuracy. Experiments demonstrate that this combination produces shapes with finer surface details, tighter image correspondence, and better handling of varied input styles than prior methods.

Core claim

TripoSG achieves state-of-the-art performance in 3D shape generation through a large-scale rectified flow transformer trained on extensive high-quality data, paired with a hybrid supervised training strategy for the 3D VAE that combines SDF, normal, and eikonal losses, yielding meshes with enhanced detail, precise fidelity to input images, and strong generalization across diverse image styles and contents.

What carries the argument

Large-scale rectified flow transformer that directly models the mapping from noise to 3D shape representations conditioned on images.

If this is right

High-resolution 3D meshes can be produced directly from single images with greater surface detail than earlier diffusion approaches.
The model maintains alignment with input images even when those images vary widely in style and content.
Data scale and processing rules become central determinants of quality in 3D generative training.
Public release of the trained model enables direct use and further extension by the community.
Hybrid losses on SDF, normals, and eikonal terms improve the underlying 3D VAE reconstruction quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could support downstream tasks such as texture mapping or animation once the mesh output is further processed.
Scaling the same rectified-flow architecture to dynamic or multi-view 3D data might extend the fidelity gains to video or scene generation.
Integration with existing image-generation pipelines could allow end-to-end creation of textured 3D assets from text prompts alone.
If the data pipeline generalizes, similar scaling strategies may accelerate progress in other data-scarce 3D domains such as medical imaging or robotics.

Load-bearing premise

The custom data processing pipeline yields two million sufficiently high-quality, diverse, and unbiased 3D training samples that support claimed fidelity and generalization in real-world use.

What would settle it

Generated meshes from a held-out set of real-world photographs show systematic mismatches in fine surface details or fail to preserve image-specific features at the claimed resolution.

read the original abstract

Recent advancements in diffusion techniques have propelled image and video generation to unprecedented levels of quality, significantly accelerating the deployment and application of generative AI. However, 3D shape generation technology has so far lagged behind, constrained by limitations in 3D data scale, complexity of 3D data processing, and insufficient exploration of advanced techniques in the 3D domain. Current approaches to 3D shape generation face substantial challenges in terms of output quality, generalization capability, and alignment with input conditions. We present TripoSG, a new streamlined shape diffusion paradigm capable of generating high-fidelity 3D meshes with precise correspondence to input images. Specifically, we propose: 1) A large-scale rectified flow transformer for 3D shape generation, achieving state-of-the-art fidelity through training on extensive, high-quality data. 2) A hybrid supervised training strategy combining SDF, normal, and eikonal losses for 3D VAE, achieving high-quality 3D reconstruction performance. 3) A data processing pipeline to generate 2 million high-quality 3D samples, highlighting the crucial rules for data quality and quantity in training 3D generative models. Through comprehensive experiments, we have validated the effectiveness of each component in our new framework. The seamless integration of these parts has enabled TripoSG to achieve state-of-the-art performance in 3D shape generation. The resulting 3D shapes exhibit enhanced detail due to high-resolution capabilities and demonstrate exceptional fidelity to input images. Moreover, TripoSG demonstrates improved versatility in generating 3D models from diverse image styles and contents, showcasing strong generalization capabilities. To foster progress and innovation in the field of 3D generation, we will make our model publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TripoSG scales a rectified flow transformer to 3D mesh generation on a 2M-sample pipeline with hybrid VAE losses, but the abstract gives no metrics to back the SOTA claim.

read the letter

The main takeaway is that this paper puts together a large rectified flow transformer for image-to-3D mesh generation, trained on a custom pipeline that produces 2 million samples, plus a 3D VAE supervised with SDF, normal, and eikonal losses. That specific mix of components and the emphasis on data rules for quality and quantity is the fresh part not directly copied from earlier 3D diffusion work. They also commit to releasing the model, which helps the field move forward on reproducibility. The description of the data pipeline and the hybrid VAE training strategy is clear and practical, showing they thought through how to handle the usual 3D data bottlenecks. Those pieces give the work a solid foundation if the experiments check out. The soft spot is the complete absence of numbers in the abstract. It states state-of-the-art fidelity and generalization but supplies no quantitative scores, baseline comparisons, ablation results, or error bars. The stress-test note on the data pipeline is accurate here: without distribution stats, artifact rates, or checks against public sets like Objaverse, it is hard to know whether the reported gains come from the model or from how the 2 million samples were filtered and balanced. If the full paper has those details and they hold, the contribution strengthens; otherwise the central performance claim stays unverified. This is for researchers working on 3D generative models who need concrete ideas on scaling data and using rectified flows for meshes. A reader focused on practical image-conditioned synthesis would get value from the pipeline description and loss choices once the results are shown. It deserves peer review because the components are described in enough detail to evaluate and the scale is ambitious, even though referees will need to see the missing metrics and validation.

Referee Report

2 major / 1 minor

Summary. The paper introduces TripoSG, a streamlined 3D shape diffusion framework that uses a large-scale rectified flow transformer trained on 2 million samples, a hybrid VAE with combined SDF/normal/eikonal losses, and a custom data processing pipeline. It claims state-of-the-art performance in high-fidelity mesh generation from images, with improved detail, input fidelity, and generalization across styles and contents, validated through comprehensive experiments and ablations.

Significance. If the empirical claims hold with proper validation, this work would advance 3D generative modeling by scaling rectified flow techniques to large data regimes and highlighting data quality rules, potentially improving applications in graphics, AR/VR, and content creation where current methods lag in fidelity and generalization.

major comments (2)

[Data Processing Pipeline] Data Processing Pipeline section: the claim that the pipeline yields 2 million high-quality, representative 3D samples enabling SOTA performance lacks quantitative support such as diversity statistics versus Objaverse, failure rates on edge cases, or artifact/bias metrics; without these, it is unclear whether reported gains stem from the rectified flow transformer or from data curation artifacts.
[Experiments] Experiments section: the abstract and framework description assert SOTA results from comprehensive experiments and component ablations, yet no specific quantitative metrics, baseline comparisons, error bars, or tabled results are referenced to substantiate the fidelity and generalization claims, leaving the central performance assertion without verifiable grounding.

minor comments (1)

[Abstract] Abstract: key numerical results (e.g., FID, CD, or user study scores) should be included to allow readers to immediately assess the SOTA claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and detailed review of our manuscript. We address each major comment point by point below and will revise the paper to incorporate additional quantitative details where appropriate.

read point-by-point responses

Referee: [Data Processing Pipeline] Data Processing Pipeline section: the claim that the pipeline yields 2 million high-quality, representative 3D samples enabling SOTA performance lacks quantitative support such as diversity statistics versus Objaverse, failure rates on edge cases, or artifact/bias metrics; without these, it is unclear whether reported gains stem from the rectified flow transformer or from data curation artifacts.

Authors: We agree that additional quantitative validation of the data processing pipeline would strengthen the manuscript. In the revised version, we will expand the Data Processing Pipeline section to include diversity statistics (such as category distributions and geometric complexity metrics) compared against Objaverse, reported failure rates from the filtering stages, and a brief analysis of potential artifacts or biases. These additions will help clarify the contribution of the curated dataset relative to the model architecture. revision: yes
Referee: [Experiments] Experiments section: the abstract and framework description assert SOTA results from comprehensive experiments and component ablations, yet no specific quantitative metrics, baseline comparisons, error bars, or tabled results are referenced to substantiate the fidelity and generalization claims, leaving the central performance assertion without verifiable grounding.

Authors: We acknowledge that the abstract and framework overview would benefit from more explicit cross-references to the quantitative results. The experiments section already presents detailed metrics (including fidelity measures, baseline comparisons, and ablation studies), but we will revise the abstract to highlight key quantitative outcomes and add direct references to the relevant tables and figures in the framework description. Error bars will be added to applicable plots in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical training and new data pipeline

full rationale

The paper presents TripoSG as a new model trained on 2 million samples from a custom data pipeline, using a rectified flow transformer and hybrid VAE losses (SDF + normal + eikonal). All central claims (SOTA fidelity, generalization) are supported by experimental validation on held-out data rather than any derivation that reduces to fitted inputs, self-definitions, or self-citation chains. The data pipeline is described as an input rather than derived from the model's outputs, and no equations or uniqueness theorems are invoked that loop back to the paper's own results. This is a standard empirical ML paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the transferability of rectified flow techniques to 3D and the sufficiency of the described losses and data rules, which are presented without detailed prior validation or external benchmarks in the abstract.

free parameters (1)

Training dataset scale
The choice of 2 million high-quality 3D samples is presented as critical to achieving fidelity and is a key scale hyperparameter selected for the training process.

axioms (1)

domain assumption Rectified flow models transfer effectively to 3D shape generation when scaled with sufficient high-quality data
The abstract assumes the technique, previously successful in image/video domains, will produce high-fidelity 3D meshes without fundamental domain-specific barriers.

pith-pipeline@v0.9.0 · 5662 in / 1448 out tokens · 51095 ms · 2026-05-16T21:45:37.289109+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models
cs.CR 2026-05 conditional novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
cs.CV 2026-04 unverdicted novelty 7.0

A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
Make-It-Poseable: Feed-forward Latent Posing Model for 3D Characters
cs.CV 2025-12 unverdicted novelty 7.0

A latent-space transformer framework poses 3D characters without skinning or fixed topologies, outperforming baselines and generalizing zero-shot to quadrupeds.
Pixal3D: Pixel-Aligned 3D Generation from Images
cs.CV 2026-05 unverdicted novelty 6.0

Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.
DVD: Discrete Voxel Diffusion for 3D Generation and Editing
cs.CV 2026-05 unverdicted novelty 6.0

DVD treats voxel occupancy as a discrete variable in a diffusion framework to generate, assess, and edit sparse 3D voxels without continuous thresholding.
PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World
cs.CV 2026-05 unverdicted novelty 6.0

PhysForge generates physics-grounded 3D assets via a VLM-planned Hierarchical Physical Blueprint and a KineVoxel Injection diffusion model, backed by the new PhysDB dataset of 150,000 annotated assets.
Velox: Learning Representations of 4D Geometry and Appearance
cs.CV 2026-05 unverdicted novelty 6.0

Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
3D-ReGen: A Unified 3D Geometry Regeneration Framework
cs.CV 2026-04 unverdicted novelty 6.0

3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.
Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data
cs.CV 2026-04 unverdicted novelty 6.0

BVE framework enables text-guided 3D editing beyond voxel limits by combining self-constructed data, lightweight semantic injection, and annotation-free masking to preserve local invariance.
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
cs.CV 2026-04 unverdicted novelty 6.0

ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
UniRecGen: Unifying Multi-View 3D Reconstruction and Generation
cs.CV 2026-04 unverdicted novelty 6.0

UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.
SegviGen: Repurposing 3D Generative Model for Part Segmentation
cs.CV 2026-03 unverdicted novelty 6.0

SegviGen shows pretrained 3D generative models can be repurposed for part segmentation via voxel colorization, beating prior methods by 40% interactively and 15% on full segmentation using only 0.32% of labeled data.
MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation
cs.CV 2026-03 unverdicted novelty 6.0

MV-SAM3D adds multi-view fusion via multi-diffusion with attention-entropy and visibility weighting plus physics-aware optimization to improve fidelity and physical plausibility in layout-aware 3D generation.
Learn2Fold: Structured Origami Generation with World Model Planning
cs.GR 2026-02 unverdicted novelty 6.0

Learn2Fold generates physically valid origami folding sequences from text prompts by decoupling LLM-based program proposals from verification in a learned graph-structured world model.
Pose-Aware Diffusion for 3D Generation
cs.CV 2026-05 unverdicted novelty 5.0

PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation
cs.GR 2026-04 unverdicted novelty 5.0

The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.
AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation
cs.CV 2026-04 unverdicted novelty 4.0

AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantica...
From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation
cs.GR 2026-04 unverdicted novelty 4.0

The paper surveys 3D content generation literature using a taxonomy of asset types and production stages to evaluate progress toward engine-ready assets.
Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation
cs.GR 2026-04 unverdicted novelty 4.0

Seed3D 2.0 advances 3D content generation via a coarse-to-fine geometry pipeline, unified PBR material model, and simulation-ready scene tools, reporting 69-89.9% win rates over commercial systems in human studies.

Reference graph

Works this paper leans on

187 extracted references · 187 canonical work pages · cited by 18 Pith papers · 13 internal anchors

[1]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Bain, M., Nagrani, A., Varol, G., and Zisserman, A. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 1728--1738, 2021

work page 2021
[2]

All are worth words: A vit backbone for diffusion models

Bao, F., Nie, S., Xue, K., Cao, Y., Li, C., Su, H., and Zhu, J. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 22669--22679, 2023

work page 2023
[4]

blackforestlabs. Flux. https://github.com/black-forest-labs/flux, 2024

work page 2024
[5]

Video generation models as world simulators

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., and Ramesh, A. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators

work page 2024
[6]

R., Nagano, K., Chan, M

Chan, E. R., Nagano, K., Chan, M. A., Bergman, A. W., Park, J. J., Levy, A., Aittala, M., Mello, S. D., Karras, T., and Wetzstein, G. Generative novel view synthesis with 3d-aware diffusion models. In IEEE/CVF ICCV , pp.\ 4194--4206, 2023

work page 2023
[8]

Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation

Chen, R., Chen, Y., Jiao, N., and Jia, K. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In IEEE/CVF ICCV , pp.\ 22189--22199, 2023

work page 2023
[9]

G., and Gui, L

Cheng, Y., Lee, H., Tulyakov, S., Schwing, A. G., and Gui, L. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In IEEE/CVF CVPR , pp.\ 4456--4465, 2023

work page 2023
[10]

P., Caron, M., Geirhos, R., Alabdulmohsin, I., et al

Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A. P., Caron, M., Geirhos, R., Alabdulmohsin, I., et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pp.\ 7480--7512. PMLR, 2023

work page 2023
[11]

Objaverse: A universe of annotated 3d objects

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., and Farhadi, A. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13142--13153, 2023

work page 2023
[12]

Y., et al

Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S. Y., et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[13]

Imagenet: A large-scale hierarchical image database

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009

work page 2009
[14]

Scaling rectified flow transformers for high-resolution image synthesis

Esser, P., Kulal, S., Blattmann, A., Entezari, R., M \"u ller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[15]

Fan, H., Su, H., and Guibas, L. J. A point set generation network for 3d object reconstruction from a single image. In IEEE/CVF CVPR , pp.\ 2463--2471, 2017

work page 2017
[17]

F., Rodriguez, M., and Gupta, A

Girdhar, R., Fouhey, D. F., Rodriguez, M., and Gupta, A. Learning a predictable and generative vector representation for objects. In Leibe, B., Matas, J., Sebe, N., and Welling, M. (eds.), ECCV , volume 9910, pp.\ 484--499, 2016

work page 2016
[18]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017

work page 2017
[19]

Denoising diffusion probabilistic models

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), NeurIPS, 2020

work page 2020
[20]

3dtopia: Large text-to-3d generation model with hybrid diffusion priors

Hong, F., Tang, J., Cao, Z., Shi, M., Wu, T., Chen, Z., Wang, T., Pan, L., Lin, D., and Liu, Z. 3dtopia: Large text-to-3d generation model with hybrid diffusion priors. CoRR, abs/2403.02234, 2024

work page arXiv 2024
[22]

Huang, J., Su, H., and Guibas, L. J. Robust watertight manifold surface generation method for shapenet models. CoRR, abs/1802.01698, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

Huang, J., Zhou, Y., and Guibas, L. J. Manifoldplus: A robust and scalable watertight manifold surface generation method for triangle soups. CoRR, abs/2005.11621, 2020

work page arXiv 2005
[24]

Neural wavelet-domain diffusion for 3d shape generation

Hui, K., Li, R., Hu, J., and Fu, C. Neural wavelet-domain diffusion for 3d shape generation. In Jung, S. K., Lee, J., and Bargteil, A. W. (eds.), SIGGRAPH Asia , pp.\ 24:1--24:9. ACM , 2022

work page 2022
[25]

Shap-E: Generating Conditional 3D Implicit Functions

Jun, H. and Nichol, A. Shap-e: Generating conditional 3d implicit functions. CoRR, abs/2305.02463, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Lan, Y., Hong, F., Yang, S., Zhou, S., Meng, X., Dai, B., Pan, X., and Loy, C. C. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. In ECCV, 2024

work page 2024
[29]

Era3d: High-resolution multiview diffusion using efficient row-wise attention

Li, P., Liu, Y., Long, X., Zhang, F., Lin, C., Li, M., Qi, X., Zhang, S., Luo, W., Tan, P., Wang, W., Liu, Q., and Guo, Y. Era3d: High-resolution multiview diffusion using efficient row-wise attention. CoRR, abs/2405.11616, 2024 b

work page arXiv 2024
[31]

Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching

Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., and Chen, Y. Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching. In IEEE/CVF CVPR, pp.\ 6517--6526, 2024

work page 2024
[32]

Magic3d: High-resolution text-to-3d content creation

Lin, C., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M., and Lin, T. Magic3d: High-resolution text-to-3d content creation. In IEEE CVPR , pp.\ 300--309, 2023

work page 2023
[33]

V., Xu, Z., and Su, H

Liu, M., Xu, C., Jin, H., Chen, L., T., M. V., Xu, Z., and Su, H. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. In NeurIPS, 2023 a

work page 2023
[34]

Meshformer: High-quality mesh generation with 3d-guided reconstruction model

Liu, M., Zeng, C., Wei, X., Shi, R., Chen, L., Xu, C., Zhang, M., Wang, Z., Zhang, X., Liu, I., Wu, H., and Su, H. Meshformer: High-quality mesh generation with 3d-guided reconstruction model. CoRR, abs/2408.10198, 2024 a

work page arXiv 2024
[35]

V., Tokmakov, P., Zakharov, S., and Vondrick, C

Liu, R., Wu, R., Hoorick, B. V., Tokmakov, P., Zakharov, S., and Vondrick, C. Zero-1-to-3: Zero-shot one image to 3d object. In IEEE/CVF ICCV , pp.\ 9264--9275, 2023 b

work page 2023
[36]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. In ICLR , 2023 c

work page 2023
[37]

Syncdreamer: Generating multiview-consistent images from a single-view image

Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., and Wang, W. Syncdreamer: Generating multiview-consistent images from a single-view image. In ICLR , 2024 b

work page 2024
[39]

Wonder3d: Single image to 3d using cross-domain diffusion

Long, X., Guo, Y.-C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.-H., Habermann, M., Theobalt, C., et al. Wonder3d: Single image to 3d using cross-domain diffusion. In IEEE/CVF CVPR, pp.\ 9970--9980, 2024

work page 2024
[40]

Lorensen, W. E. and Cline, H. E. Marching cubes: A high resolution 3d surface construction algorithm. In Stone, M. C. (ed.), Proceedings of the SIGGRAPH , pp.\ 163--169, 1987

work page 1987
[41]

Pc\( ^ 2 \): Projection-conditioned point cloud diffusion for single-image 3d reconstruction

Melas - Kyriazi, L., Rupprecht, C., and Vedaldi, A. Pc\( ^ 2 \): Projection-conditioned point cloud diffusion for single-image 3d reconstruction. In IEEE/CVF CVPR , pp.\ 12923--12932, 2023

work page 2023
[42]

M., Oechsle, M., Niemeyer, M., Nowozin, S., and Geiger, A

Mescheder, L. M., Oechsle, M., Niemeyer, M., Nowozin, S., and Geiger, A. Occupancy networks: Learning 3d reconstruction in function space. In IEEE/CVF CVPR , pp.\ 4460--4470, 2019

work page 2019
[43]

Latent-nerf for shape-guided generation of 3d shapes and textures

Metzer, G., Richardson, E., Patashnik, O., Giryes, R., and Cohen - Or, D. Latent-nerf for shape-guided generation of 3d shapes and textures. In IEEE/CVF CVPR , pp.\ 12663--12673, 2023

work page 2023
[44]

P., Tancik, M., Barron, J

Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., and Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. In Vedaldi, A., Bischof, H., Brox, T., and Frahm, J. (eds.), ECCV, volume 12346, pp.\ 405--421, 2020

work page 2020
[45]

R., Kontschieder, P., and Nie ner, M

M \" u ller, N., Siddiqui, Y., Porzi, L., Bul \` o , S. R., Kontschieder, P., and Nie ner, M. Diffrf: Rendering-guided 3d radiance field diffusion. In IEEE/CVF CVPR , pp.\ 4328--4338, 2023

work page 2023
[46]

A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A

Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A. J., Kohli, P., Shotton, J., Hodges, S., and Fitzgibbon, A. W. Kinectfusion: Real-time dense surface mapping and tracking. In IEEE ISMAR , pp.\ 127--136, 2011

work page 2011
[47]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., and Chen, M. Point-e: A system for generating 3d point clouds from complex prompts. CoRR, abs/2212.08751, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Gpt-4v(ision) system card

OpenAI. Gpt-4v(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023 b . Accessed: 2023-09-25

work page 2023
[51]

and Xie, S

Peebles, W. and Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 4195--4205, 2023

work page 2023
[53]

T., and Mildenhall, B

Poole, B., Jain, A., Barron, J. T., and Mildenhall, B. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR , 2023

work page 2023
[54]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

work page 2021
[55]

Zero-shot text-to-image generation

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I. Zero-shot text-to-image generation. In ICML , volume 139, pp.\ 8821--8831, 2021

work page 2021
[56]

Scaling vision with sparse mixture of experts

Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Susano Pinto, A., Keysers, D., and Houlsby, N. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34: 0 8583--8595, 2021

work page 2021
[57]

High-resolution image synthesis with latent diffusion models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022

work page 2022
[58]

Laion-5b: An open large-scale dataset for training next generation image-text models

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 0 25278--25294, 2022

work page 2022
[59]

Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

Shi, R., Chen, H., Zhang, Z., Liu, M., Xu, C., Wei, X., Chen, L., Zeng, C., and Su, H. Zero123++: a single image to consistent multi-view diffusion base model. CoRR, abs/2310.15110, 2023 a

work page internal anchor Pith review arXiv 2023
[61]

Mvdream: Multi-view diffusion for 3d generation

Shi, Y., Wang, P., Ye, J., Mai, L., Li, K., and Yang, X. Mvdream: Multi-view diffusion for 3d generation. In ICLR , 2024

work page 2024
[62]

R., Chan, E

Shue, J. R., Chan, E. R., Po, R., Ankner, Z., Wu, J., and Wetzstein, G. 3d neural field generation using triplane diffusion. In IEEE/CVF CVPR , pp.\ 20875--20886, 2023

work page 2023
[63]

Make-a-video: Text-to-video generation without text-video data

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., and Taigman, Y. Make-a-video: Text-to-video generation without text-video data. In ICLR , 2023

work page 2023
[64]

Make-it-3d: High-fidelity 3d creation from A single image with diffusion prior

Tang, J., Wang, T., Zhang, B., Zhang, T., Yi, R., Ma, L., and Chen, D. Make-it-3d: High-fidelity 3d creation from A single image with diffusion prior. In IEEE ICCV , pp.\ 22762--22772, 2023

work page 2023
[65]

Dreamgaussian: Generative gaussian splatting for efficient 3d content creation

Tang, J., Ren, J., Zhou, H., Liu, Z., and Zeng, G. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. In ICLR , 2024

work page 2024
[66]

team at Meta, T. M. G. Movie gen: A cast of media foundation models. 2024

work page 2024
[68]

N., Kaiser, L., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In NeurIPS, pp.\ 5998--6008, 2017

work page 2017
[69]

Diffusers: State-of-the-art diffusion models

von Platen, P., Patil, S., Lozhkov, A., Cuenca, P., Lambert, N., Rasul, K., Davaadorj, M., Nair, D., Paul, S., Berman, W., Xu, Y., Liu, S., and Wolf, T. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022

work page 2022
[70]

A., and Shakhnarovich, G

Wang, H., Du, X., Li, J., Yeh, R. A., and Shakhnarovich, G. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In IEEE CVPR , pp.\ 12619--12629, 2023 a

work page 2023
[71]

Pixel2mesh: Generating 3d mesh models from single RGB images

Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., and Jiang, Y. Pixel2mesh: Generating 3d mesh models from single RGB images. In Ferrari, V., Hebert, M., Sminchisescu, C., and Weiss, Y. (eds.), ECCV , volume 11215, pp.\ 55--71, 2018

work page 2018
[74]

Dual octree graph networks for learning adaptive volumetric shape representations

Wang, P., Liu, Y., and Tong, X. Dual octree graph networks for learning adaptive volumetric shape representations. ACM Transactions on Graphics (TOG) , 41 0 (4): 0 103:1--103:15, 2022

work page 2022
[77]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation

Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., and Zhu, J. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In NeurIPS, 2023 d

work page 2023
[80]

Multi-view mesh reconstruction with neural deferred shading

Worchel, M., Diaz, R., Hu, W., Schreer, O., Feldmann, I., and Eisert, P. Multi-view mesh reconstruction with neural deferred shading. In IEEE/CVF CVPR , pp.\ 6177--6187, 2022

work page 2022
[81]

Marrnet: 3d shape reconstruction via 2.5d sketches

Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, B., and Tenenbaum, J. Marrnet: 3d shape reconstruction via 2.5d sketches. In NeurIPS, pp.\ 540--550, 2017

work page 2017
[82]

Z., Ge, Y., Wang, X., Lei, S

Wu, J. Z., Ge, Y., Wang, X., Lei, S. W., Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., and Shou, M. Z. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In IEEE/CVF ICCV , pp.\ 7589--7599, 2023

work page 2023
[83]

Unique3d: High-quality and efficient 3d mesh generation from a single image.arXiv preprint arXiv:2405.20343, 2024

Wu, K., Liu, F., Cai, Z., Yan, R., Wang, H., Hu, Y., Duan, Y., and Ma, K. Unique3d: High-quality and efficient 3d mesh generation from a single image. CoRR, abs/2405.20343, 2024 a

work page arXiv 2024
[84]

PQ-NET: A generative part seq2seq network for 3d shapes

Wu, R., Zhuang, Y., Xu, K., Zhang, H., and Chen, B. PQ-NET: A generative part seq2seq network for 3d shapes. In IEEE/CVF CVPR , pp.\ 826--835, 2020

work page 2020
[86]

J., Lin, D., and Wetzstein, G

Wu, T., Yang, G., Li, Z., Zhang, K., Liu, Z., Guibas, L. J., Lin, D., and Wetzstein, G. Gpt-4v(ision) is a human-aligned evaluator for text-to-3d generation. In IEEE/CVF CVPR , pp.\ 22227--22238, 2024 c

work page 2024
[88]

DISN: deep implicit surface network for high-quality single-view 3d reconstruction

Xu, Q., Wang, W., Ceylan, D., Mech, R., and Neumann, U. DISN: deep implicit surface network for high-quality single-view 3d reconstruction. In NeurIPS, pp.\ 490--500, 2019

work page 2019
[90]

Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models

Yi, T., Fang, J., Wang, J., Wu, G., Xie, L., Zhang, X., Liu, W., Tian, Q., and Wang, X. Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models. In IEEE CVPR, pp.\ 6796--6807, 2024

work page 2024
[91]

pixelnerf: Neural radiance fields from one or few images

Yu, A., Ye, V., Tancik, M., and Kanazawa, A. pixelnerf: Neural radiance fields from one or few images. In IEEE/CVF CVPR , pp.\ 4578--4587, 2021

work page 2021
[92]

Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp.\ 818--833. Springer, 2014

work page 2014
[93]

LION: latent point diffusion models for 3d shape generation

Zeng, X., Vahdat, A., Williams, F., Gojcic, Z., Litany, O., Fidler, S., and Kreis, K. LION: latent point diffusion models for 3d shape generation. In NeurIPS, 2022

work page 2022
[94]

and Sennrich, R

Zhang, B. and Sennrich, R. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[95]

3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models

Zhang, B., Tang, J., Niessner, M., and Wonka, P. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. ACM Transactions on Graphics (TOG), 42 0 (4): 0 1--16, 2023

work page 2023
[97]

Clay: A controllable large-scale generative model for creating high-quality 3d assets

Zhang, L., Wang, Z., Zhang, Q., Qiu, Q., Pang, A., Jiang, H., Yang, W., Xu, L., and Yu, J. Clay: A controllable large-scale generative model for creating high-quality 3d assets. ACM Transactions on Graphics (TOG), 43 0 (4): 0 1--20, 2024 b

work page 2024
[98]

Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation

Zhao, Z., Liu, W., Chen, X., Zeng, X., Wang, R., Cheng, P., Fu, B., Chen, T., Yu, G., and Gao, S. Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[99]

Locally attentional SDF diffusion for controllable 3d shape generation

Zheng, X., Pan, H., Wang, P., Tong, X., Liu, Y., and Shum, H. Locally attentional SDF diffusion for controllable 3d shape generation. ACM Trans. Graph. , 42 0 (4): 0 91:1--91:13, 2023

work page 2023
[100]

3d shape generation and completion through point-voxel diffusion

Zhou, L., Du, Y., and Wu, J. 3d shape generation and completion through point-voxel diffusion. In IEEE/CVF ICCV , pp.\ 5806--5815, 2021

work page 2021
[101]

Sparse3d: Distilling multiview-consistent diffusion for object reconstruction from sparse views

Zou, Z., Cheng, W., Cao, Y., Huang, S., Shan, Y., and Zhang, S. Sparse3d: Distilling multiview-consistent diffusion for object reconstruction from sparse views. In AAAI , pp.\ 7900--7908, 2024 a

work page 2024
[102]

Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers

Zou, Z.-X., Yu, Z., Guo, Y.-C., Li, Y., Liang, D., Cao, Y.-P., and Zhang, S.-H. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10324--10335, 2024 b

work page 2024

Showing first 80 references.