pith. machine review for the scientific record. sign in

arxiv: 2305.02463 · v1 · submitted 2023-05-03 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

Shap-E: Generating Conditional 3D Implicit Functions

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:27 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords text-to-3Dimplicit functionsdiffusion modelsneural radiance fields3D generationconditional generative modelsmeshes
0
0 comments X

The pith

Shap-E generates parameters of implicit functions for 3D assets directly from text prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Shap-E as a conditional generative model that outputs the parameters of implicit functions rather than a single fixed representation. These parameters support rendering as both textured meshes and neural radiance fields. Training proceeds in two stages: an encoder first maps existing 3D assets deterministically into the implicit-function parameter space, after which a conditional diffusion model learns to generate new parameter sets from text or other conditions. The resulting system produces complex and diverse 3D assets in seconds when trained on large paired text-3D datasets and matches or exceeds the quality of prior explicit point-cloud generators while handling a higher-dimensional output space.

Core claim

Shap-E trains an encoder that deterministically maps 3D assets into the parameters of an implicit function, then trains a conditional diffusion model on those encoded outputs so that text conditions can produce new implicit-function parameters that render as both textured meshes and neural radiance fields.

What carries the argument

A two-stage pipeline consisting of a deterministic encoder that converts 3D assets into implicit-function parameters, followed by a conditional diffusion model that generates new parameters in that space.

If this is right

  • A single trained model produces both textured meshes and neural radiance fields from the same implicit-function parameters.
  • Generation completes in seconds once the model is trained on paired 3D and text data.
  • Sample quality reaches or exceeds that of explicit point-cloud generators while operating in a higher-dimensional multi-representation space.
  • The diffusion process runs entirely in the compressed parameter space produced by the encoder.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same parameter space might support conditioning on images or sketches in addition to text without retraining the encoder.
  • Because the output is an implicit function, downstream tasks such as collision detection or ray-tracing can operate directly on the generated representation.
  • Scaling the encoder to larger or more diverse 3D datasets could reduce the information loss that currently limits the diffusion stage.

Load-bearing premise

The encoder can map arbitrary 3D assets into implicit-function parameters with negligible information loss so the diffusion model sees a faithful latent space.

What would settle it

Observe whether text-conditioned outputs consistently produce 3D assets whose rendered views match the prompt description and exhibit visual diversity across different samplings.

read the original abstract

We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space. We release model weights, inference code, and samples at https://github.com/openai/shap-e.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper presents Shap-E, a conditional generative model for 3D assets that directly outputs parameters of implicit functions renderable as both textured meshes and neural radiance fields. Training proceeds in two stages: an encoder deterministically maps 3D assets to implicit-function parameters, after which a conditional diffusion model is trained on the encoder outputs using paired 3D-text data. The resulting models generate complex, diverse 3D assets in seconds and are reported to converge faster while achieving comparable or better sample quality than the Point-E baseline despite operating in a higher-dimensional, multi-representation output space. Model weights, inference code, and samples are released.

Significance. If the empirical claims hold, the work provides a practical advance in 3D generative modeling by producing implicit representations that support multiple downstream renderings. The two-stage architecture with publicly released implementation enables direct empirical verification of reconstruction fidelity and generation quality, strengthening reproducibility.

minor comments (3)
  1. [Abstract] Abstract: the phrase 'a large dataset of paired 3D and text data' would be strengthened by reporting the approximate number of assets or total training tokens to allow readers to gauge scale.
  2. [Results] Results section: the comparison tables would benefit from an additional column or row reporting reconstruction PSNR or IoU for the encoder stage alone, to quantify information loss before the diffusion stage.
  3. [Method] The implicit-function parameterization (Eq. 1 or equivalent) could include an explicit statement of the number of output channels or basis functions used for texture, to clarify the dimensionality increase relative to Point-E.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their accurate summary of Shap-E, for highlighting the significance of the two-stage training procedure and multi-representation output, and for recommending acceptance. We are pleased that the practical advantages and reproducibility aspects were noted.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical two-stage pipeline: an encoder that maps 3D assets to implicit-function parameters, followed by a conditional diffusion model trained on those encoder outputs using external paired 3D-text data. Performance claims rest on direct training and comparison against the independently published Point-E baseline rather than any internal equation that reduces a reported metric to a quantity defined by the authors' own fitted constants or self-citation chain. No load-bearing step matches the enumerated circularity patterns; the derivation is self-contained against external benchmarks and released code.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard assumption that implicit functions can faithfully represent the training assets and that diffusion models can learn the distribution of their parameters; no new physical entities or ad-hoc constants are introduced beyond ordinary neural-network hyperparameters.

free parameters (1)
  • diffusion model training hyperparameters
    Standard learning-rate, noise schedule, and architecture choices fitted during optimization on the encoder outputs.
axioms (1)
  • domain assumption An encoder can map 3D assets into implicit-function parameters with negligible loss of geometric and textural information
    Invoked in the first training stage described in the abstract.

pith-pipeline@v0.9.0 · 5457 in / 1233 out tokens · 28207 ms · 2026-05-16T15:27:46.007085+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ReConText3D: Replay-based Continual Text-to-3D Generation

    cs.CV 2026-04 conditional novelty 8.0

    ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.

  2. Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion

    cs.CV 2026-04 unverdicted novelty 7.0

    3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.

  3. VecSet-Edit: Unleashing Pre-trained LRM for Mesh Editing from Single Image

    cs.CV 2026-02 unverdicted novelty 7.0

    VecSet-Edit is the first method to perform high-fidelity mesh editing from a single image by analyzing and manipulating spatial token subsets in a pre-trained VecSet LRM.

  4. Affostruction: 3D Affordance Grounding with Generative Reconstruction

    cs.CV 2026-01 unverdicted novelty 7.0

    Affostruction reconstructs full 3D object geometry from partial RGBD views and grounds text-based affordances on both visible and unobserved surfaces, reporting large gains over prior methods.

  5. Structured 3D Latents for Scalable and Versatile 3D Generation

    cs.CV 2024-12 unverdicted novelty 7.0

    SLAT provides a unified 3D latent representation enabling versatile high-quality generation across multiple output formats from text or image inputs.

  6. LRM: Large Reconstruction Model for Single Image to 3D

    cs.CV 2023-11 conditional novelty 7.0

    LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.

  7. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

    cs.CV 2023-09 unverdicted novelty 7.0

    DreamGaussian creates high-quality textured 3D meshes from single-view images in 2 minutes via generative Gaussian Splatting with mesh extraction and UV refinement.

  8. REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement

    cs.CV 2026-04 unverdicted novelty 6.0

    REVIVE 3D generates voluminous 3D assets from flat 2D images via an inflated prior construction followed by latent-space refinement, plus new metrics for volume and flatness validated by user study.

  9. Point-MF: One-step Point Cloud Generation from a Single Image via Mean Flows

    cs.CV 2026-04 unverdicted novelty 6.0

    Point-MF performs one-step point cloud reconstruction from single images by learning a mean velocity field in point space with a tailored Diffusion Transformer and a new auxiliary loss.

  10. Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

    cs.CV 2026-04 unverdicted novelty 6.0

    Sculpt4D generates temporally coherent 4D shapes by integrating a block sparse attention mechanism with time-decaying mask into a pretrained 3D diffusion transformer, achieving SOTA results with 56% less computation.

  11. SIC3D: Style Image Conditioned Text-to-3D Gaussian Splatting Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    SIC3D generates text-to-3D objects with Gaussian splatting then stylizes them using Variational Stylized Score Distillation loss plus scaling regularization to improve style match and geometry fidelity.

  12. TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

    cs.CV 2025-02 unverdicted novelty 6.0

    TripoSG generates high-fidelity 3D meshes from input images via a large-scale rectified flow transformer and hybrid-trained 3D VAE on a custom 2-million-sample dataset, claiming state-of-the-art fidelity and generalization.

  13. InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    cs.CV 2024-04 unverdicted novelty 6.0

    InstantMesh produces diverse, high-quality 3D meshes from single images in seconds by combining a multi-view diffusion model with a sparse-view large reconstruction model and optimizing directly on meshes.

  14. SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

    cs.CV 2023-09 unverdicted novelty 6.0

    SyncDreamer produces multiview-consistent images from a single input image by jointly modeling their distribution and synchronizing intermediate diffusion states via 3D-aware attention.

  15. MVDream: Multi-view Diffusion for 3D Generation

    cs.CV 2023-08 conditional novelty 6.0

    MVDream is a multi-view diffusion model that functions as a generalizable 3D prior, enabling more consistent text-to-3D generation and few-shot 3D concept learning from 2D examples.

  16. SpatialPrompt: XR-Based Spatial Intent Expression as Executable Constraints for AI Generative 3D Design

    cs.HC 2026-05 unverdicted novelty 5.0

    SpatialPrompt turns spatial sketches and voice prompts into executable constraints for controllable AI 3D generation in XR, enabling iterative collaborative creation with color-coded contributions.

  17. From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation

    cs.GR 2026-04 unverdicted novelty 5.0

    The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.

  18. Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images

    cs.CV 2026-04 unverdicted novelty 5.0

    Unposed-to-3D learns simulation-ready 3D vehicle models from unposed real images by predicting camera parameters for photometric self-supervision, then adding scale prediction and harmonization.

  19. MOC-3D: Manifold-Order Consistency for Text-to-3D Generation

    cs.CV 2026-05 unverdicted novelty 4.0

    MOC-3D adds a semantic view-order constraint using CLIP monotonicity and a manifold-based feature continuity module on SPD Riemannian space to reduce macro-topological and micro-geometric inconsistencies in SDS-based ...

  20. From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation

    cs.GR 2026-04 unverdicted novelty 4.0

    The paper surveys 3D content generation literature using a taxonomy of asset types and production stages to evaluate progress toward engine-ready assets.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 19 Pith papers · 32 internal anchors

  1. [1]

    Learning Representations and Generative Models for 3D Point Clouds

    Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning represen- tations and generative models for 3d point clouds. arXiv:1707.02392, 2017

  2. [2]

    MusicLM: Generating Music From Text

    Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, and Christian Frank. Musiclm: Generating music from text, 2023. URL https://arxiv.org/ abs/2301.11325

  3. [3]

    Sine: Semantic-driven image-based nerf editing with prior-guided editing field

    Chong Bao, Yinda Zhang, Bangbang Yang, Tianxing Fan, Zesong Yang, Hujun Bao, Guofeng Zhang, and Zhaopeng Cui. Sine: Semantic-driven image-based nerf editing with prior-guided editing field. arXiv:2303.13277, 2023

  4. [4]

    Gaudi: A neural architect for immersive 3d scene generation

    Miguel Angel Bautista, Pengsheng Guo, Samira Abnar, Walter Talbott, Alexander Toshev, Zhuoyuan Chen, Laurent Dinh, Shuangfei Zhai, Hanlin Goh, Daniel Ulbricht, Afshin De- hghan, and Josh Susskind. Gaudi: A neural architect for immersive 3d scene generation. arXiv:2207.13751, 2022

  5. [5]

    Audiolm: a language modeling approach to audio generation, 2022

    Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. Audiolm: a language modeling approach to audio generation, 2022. URL https://arxiv.org/abs/ 2209.03143

  6. [6]

    Transformers as meta-learners for implicit neural representa- tions, 2022

    Yinbo Chen and Xiaolong Wang. Transformers as meta-learners for implicit neural representa- tions, 2022. URL https://arxiv.org/abs/2208.02801

  7. [7]

    Perception prioritized training of diffusion models, 2022

    Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prioritized training of diffusion models, 2022

  8. [8]

    A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1): 1–38, 1977. ISSN 00359246. URL http://www.jstor.org/stable/2984875

  9. [9]

    Diffusion Models Beat GANs on Image Synthesis

    Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. arXiv:2105.05233, 2021

  10. [10]

    Jukebox: A Generative Model for Music

    Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. arXiv:2005.00341, 2020

  11. [11]

    An efficient method of triangulating equi-valued surfaces by using tetrahedral cells

    Akio Doi and Akio Koide. An efficient method of triangulating equi-valued surfaces by using tetrahedral cells. IEICE Transactions on Information and Systems, 74:214–224, 1991

  12. [12]

    Emilien Dupont, Hyunjik Kim, S. M. Ali Eslami, Danilo Rezende, and Dan Rosenbaum. From data to functa: Your data point is a function and you can treat it like one. arXiv:2201.12204, 2022. 14

  13. [13]

    Neural spline flows

    Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. arXiv:1906.04032, 2019

  14. [14]

    Hyperdiffusion: Generating implicit neural fields with weight-space diffusion, 2023

    Ziya Erkoç, Fangchang Ma, Qi Shan, Matthias Nießner, and Angela Dai. Hyperdiffusion: Generating implicit neural fields with weight-space diffusion, 2023

  15. [15]

    Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of- denoising-experts

    Zhida Feng, Zhenyu Zhang, Xintong Yu, Yewei Fang, Lanxin Li, Xuyi Chen, Yuxiang Lu, Jiaxiang Liu, Weichong Yin, Shikun Feng, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of- denoising-experts. arXiv:2210.15257, 2022

  16. [16]

    Shapecrafter: A recursive text-conditioned 3d shape generation model

    Rao Fu, Xiao Zhan, Yiwen Chen, Daniel Ritchie, and Srinath Sridhar. Shapecrafter: A recursive text-conditioned 3d shape generation model. arXiv:2207.09446, 2022

  17. [17]

    Make-a-scene: Scene-based text-to-image generation with human priors

    Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. arXiv:2203.13131, 2022

  18. [18]

    Get3d: A generative model of high quality 3d textured shapes learned from images

    Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. arXiv:2209.11163, 2022

  19. [19]

    Generative Adversarial Networks

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.arXiv:1406.2661, 2014

  20. [20]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv:1606.08415, 2016

  21. [21]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. URL https://openreview. net/forum?id=qw8AKxfYbI

  22. [22]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv:2006.11239, 2020

  23. [23]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models. arXiv:2210.02303, 2022

  24. [24]

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models. arXiv:2204.03458, 2022

  25. [25]

    Park, Tao Wang, Timo I

    Qingqing Huang, Daniel S. Park, Tao Wang, Timo I. Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, Jesse Engel, Quoc V . Le, William Chan, Zhifeng Chen, and Wei Han. Noise2music: Text-conditioned music generation with diffusion models, 2023. URL https://arxiv.org/abs/2302.03917

  26. [26]

    Barron, Pieter Abbeel, and Ben Poole

    Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, and Ben Poole. Zero-shot text-guided object generation with dream fields. arXiv:2112.01455, 2021

  27. [27]

    Elucidating the Design Space of Diffusion-Based Generative Models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. arXiv:2206.00364, 2022

  28. [28]

    Clip-mesh: Generating textured meshes from text using pretrained image-text models

    Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, and Tiberiu Popa. Clip-mesh: Generating textured meshes from text using pretrained image-text models. arXiv:2203.13333, 2022

  29. [29]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014

  30. [30]

    NeRF-V AE: A geometry aware 3D scene generative model

    Adam R Kosiorek, Heiko Strathmann, Daniel Zoran, Pol Moreno, Rosalia Schneider, So ˇna Mokrá, and Danilo J Rezende. NeRF-V AE: A geometry aware 3D scene generative model. arXiv:2104.00587, April 2021. 15

  31. [31]

    Audiogen: Textually guided audio generation,

    Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. Audiogen: Textually guided audio generation,

  32. [32]

    URL https://arxiv.org/abs/2209.15352

  33. [33]

    Modular primitives for high-performance differentiable rendering

    Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular primitives for high-performance differentiable rendering. arXiv:2011.03277, 2020

  34. [34]

    Magic3d: High-resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. arXiv:2211.10440, 2022

  35. [35]

    Towards implicit text-guided 3d shape generation

    Zhengzhe Liu, Yi Wang, Xiaojuan Qi, and Chi-Wing Fu. Towards implicit text-guided 3d shape generation. arXiv:2203.14622, 2022

  36. [36]

    Lorensen and Harvey E

    William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. In Maureen C. Stone, editor, SIGGRAPH, pages 163–169. ACM,

  37. [37]

    URL http://dblp.uni-trier.de/db/conf/siggraph/ siggraph1987.html#LorensenC87

    ISBN 0-89791-227-6. URL http://dblp.uni-trier.de/db/conf/siggraph/ siggraph1987.html#LorensenC87

  38. [38]

    Diffusion probabilistic models for 3d point cloud generation

    Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. arXiv:2103.01458, 2021

  39. [39]

    Mixed Precision Training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. arXiv:1710.03740, 2017

  40. [40]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. arXiv:2003.08934, 2020

  41. [41]

    Improved Denoising Diffusion Probabilistic Models

    Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. arXiv:2102.09672, 2021

  42. [42]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741, 2021

  43. [43]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts. arXiv:2212.08751, 2022

  44. [44]

    Bench- mark for compositional text-to-image synthesis

    Dong Huk Park, Samaneh Azadi, Xihui Liu, Trevor Darrell, and Anna Rohrbach. Bench- mark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural In- formation Processing Systems Datasets and Benchmarks Track (Round 1) , 2021. URL https://openreview.net/forum?id=bKBhQhPeKaF

  45. [45]

    DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation

    Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Love- grove. Deepsdf: Learning continuous signed distance functions for shape representation. arXiv:1901.05103, 2019

  46. [46]

    Christine Payne. Musenet. OpenAI blog, 2019. URL https://openai.com/blog/musenet

  47. [47]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv:2209.14988, 2022

  48. [48]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. arXiv:2103.00020, 2021

  49. [49]

    Zero-Shot Text-to-Image Generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv:2102.12092, 2021

  50. [50]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125, 2022. 16

  51. [51]

    Accelerating 3D Deep Learning with PyTorch3D

    Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020

  52. [52]

    Generating Diverse High-Fidelity Images with VQ-VAE-2

    Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-V AE-2.arXiv:1906.00446, 2019

  53. [53]

    Variational Inference with Normalizing Flows

    Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv:1505.05770, 2015

  54. [54]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. arXiv:2112.10752, 2021

  55. [55]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. arXiv:2205.11487, 2022

  56. [56]

    Lambourne, Ye Wang, Chin-Yi Cheng, Marco Fumero, and Kamal Rahimi Malekshan

    Aditya Sanghi, Hang Chu, Joseph G. Lambourne, Ye Wang, Chin-Yi Cheng, Marco Fumero, and Kamal Rahimi Malekshan. Clip-forge: Towards zero-shot text-to-shape generation. arXiv:2110.02624, 2021

  57. [57]

    Textcraft: Zero-shot generation of high-fidelity and diverse shapes from text

    Aditya Sanghi, Rao Fu, Vivian Liu, Karl Willis, Hooman Shayani, Amir Hosein Khasahmadi, Srinath Sridhar, and Daniel Ritchie. Textcraft: Zero-shot generation of high-fidelity and diverse shapes from text. arXiv:2211.01427, 2022

  58. [58]

    Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis

    Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. arXiv:2111.04276, 2021

  59. [59]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. arXiv:2209.14792, 2022

  60. [60]

    Deep Unsupervised Learning using Nonequilibrium Thermodynamics

    Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv:1503.03585, 2015

  61. [61]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020

  62. [62]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. arXiv:arXiv:1907.05600, 2020

  63. [63]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv:2011.13456, 2020

  64. [64]

    Marching cubes 33: Construction of topologically correct isosurfaces

    Evgueni Tcherniaev. Marching cubes 33: Construction of topologically correct isosurfaces. 01 1996

  65. [65]

    Neural Discrete Representation Learning

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. arXiv:1711.00937, 2017

  66. [66]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv:1706.03762, 2017

  67. [67]

    Yeh, and Greg Shakhnarovich

    Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. arXiv:2212.00774, 2022

  68. [68]

    Rodin: A generative model for sculpting 3d digital avatars using diffusion, 2022

    Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, and Baining Guo. Rodin: A generative model for sculpting 3d digital avatars using diffusion, 2022. URL https://arxiv.org/abs/2212. 06135. 17

  69. [69]

    Novel view synthesis with diffusion models

    Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. arXiv:2210.04628, 2022

  70. [70]

    The emergence of deepfake technology: A review

    Mika Westerlund. The emergence of deepfake technology: A review. Technology Innovation Management Review, 9:40–53, 11/2019 2019. ISSN 1927-0321. doi: http://doi.org/10.22215/ timreview/1282. URL timreview.ca/article/1282

  71. [71]

    Pointflow: 3d point cloud generation with continuous normalizing flows

    Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath Hariharan. Pointflow: 3d point cloud generation with continuous normalizing flows. arXiv:1906.12320, 2019

  72. [72]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. arXiv:2206.10789, 2022

  73. [73]

    Lion: Latent point diffusion models for 3d shape generation

    Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent point diffusion models for 3d shape generation. arXiv:2210.06978, 2022

  74. [74]

    DeepFakes

    Kai Zhang, Nick Kolkin, Sai Bi, Fujun Luan, Zexiang Xu, Eli Shechtman, and Noah Snavely. Arf: Artistic radiance fields. arXiv:2206.06360, 2022. 18 Algorithm 1 High-level pseudocode of our encoder architecture. Input point cloudp, multiview point cloudm, learned input embedding sequencehl. Outputs: latent variableh and MLP parametersθ. 1: h ← Cat([PointConv...