pith. machine review for the scientific record. sign in

arxiv: 2307.05663 · v1 · submitted 2023-07-11 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

Objaverse-XL: A Universe of 10M+ 3D Objects

Authors on Pith no claims yet

Pith reviewed 2026-05-17 12:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D datasetObjaverse-XLnovel view synthesiszero-shot generalization3D visionmulti-view renderinglarge-scale data
0
0 comments X

The pith

Objaverse-XL supplies over 10 million 3D objects that let models like Zero123 reach strong zero-shot generalization on novel view synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Objaverse-XL, a dataset of more than 10 million deduplicated 3D objects gathered from manual designs, photogrammetry scans, and professional artifact scans. It notes that 3D vision has lagged behind 2D and language models mainly because of limited high-quality training data. The authors render over 100 million multi-view images from the dataset and train Zero123 for novel view synthesis. This produces strong zero-shot generalization on the task. The work releases the dataset to support further scaling experiments in 3D vision.

Core claim

Objaverse-XL comprises over 10 million deduplicated 3D objects drawn from manually designed models, photogrammetry scans of landmarks and everyday items, and professional scans of historic artifacts. Training Zero123 on novel view synthesis with more than 100 million multi-view rendered images from this collection yields strong zero-shot generalization abilities.

What carries the argument

Objaverse-XL dataset of over 10 million diverse 3D objects, which supplies the volume and variety needed to generate 100 million-plus multi-view training images for 3D models.

If this is right

  • Zero123 exhibits stronger zero-shot performance on novel view synthesis when trained at the scale enabled by Objaverse-XL.
  • 3D vision models gain access to training volumes previously unavailable, mirroring data-scaling benefits seen in 2D vision.
  • Diversity across object sources supports generalization to varied object types and scanning styles.
  • The dataset supports additional large-scale experiments that were previously limited by data availability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scale of rendered multi-view data could be tested on other 3D tasks such as reconstruction or conditional generation.
  • Pretrained models derived from this data volume may transfer to downstream applications like robotics or AR that require 3D understanding.
  • Future work could measure how much further performance improves when the dataset size grows beyond the current 10 million objects.

Load-bearing premise

Observed gains in zero-shot performance stem primarily from the scale and diversity of the 3D objects rather than from other training choices or evaluation details that were not isolated.

What would settle it

Retraining Zero123 on the same procedure but with a much smaller subset of the objects and finding that zero-shot generalization remains comparable would falsify the central claim.

read the original abstract

Natural language processing and 2D vision models have attained remarkable proficiency on many tasks primarily by escalating the scale of training data. However, 3D vision tasks have not seen the same progress, in part due to the challenges of acquiring high-quality 3D data. In this work, we present Objaverse-XL, a dataset of over 10 million 3D objects. Our dataset comprises deduplicated 3D objects from a diverse set of sources, including manually designed objects, photogrammetry scans of landmarks and everyday items, and professional scans of historic and antique artifacts. Representing the largest scale and diversity in the realm of 3D datasets, Objaverse-XL enables significant new possibilities for 3D vision. Our experiments demonstrate the improvements enabled with the scale provided by Objaverse-XL. We show that by training Zero123 on novel view synthesis, utilizing over 100 million multi-view rendered images, we achieve strong zero-shot generalization abilities. We hope that releasing Objaverse-XL will enable further innovations in the field of 3D vision at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Objaverse-XL, a dataset of over 10 million deduplicated 3D objects drawn from manual designs, photogrammetry scans of landmarks and everyday items, and professional scans of historic artifacts. It positions this as the largest and most diverse 3D dataset to date. The central empirical demonstration renders over 100 million multi-view images from these objects and trains Zero123 on novel view synthesis, reporting strong zero-shot generalization.

Significance. If the reported gains are attributable to dataset scale, Objaverse-XL supplies a valuable public resource that could enable the same kind of scaling progress in 3D vision that large corpora have produced in NLP and 2D vision. The open release of 10M+ objects together with the 100M+ rendered views is a concrete community asset; the authors deserve credit for the curation effort and for making the data available.

major comments (2)
  1. [Experiments] Experiments section: the claim that 'improvements enabled with the scale provided by Objaverse-XL' are demonstrated by training Zero123 on >100M renders is not supported by a controlled comparison. No ablation is described that holds the rendering pipeline (camera sampling, lighting, resolution), optimizer, and evaluation protocol fixed while varying only the source dataset or its size (e.g., original Objaverse vs. Objaverse-XL subsets). Consequently the zero-shot gains cannot be isolated from unablated training or rendering choices.
  2. [Experiments] Experiments section: quantitative results for the Zero123 zero-shot novel-view-synthesis task lack reported baselines with exact matching settings, ablation controls, and statistical significance measures (e.g., standard errors or multiple runs). This weakens the strength of the scaling demonstration.
minor comments (2)
  1. [Abstract] Abstract: the quantitative improvements (e.g., specific metrics on zero-shot NVS) are not stated; adding one or two headline numbers would strengthen the summary.
  2. [Dataset] Dataset section: the deduplication procedure and any quantitative measure of diversity (e.g., category coverage or geometric variation statistics) should be described more explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of our experimental results.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the claim that 'improvements enabled with the scale provided by Objaverse-XL' are demonstrated by training Zero123 on >100M renders is not supported by a controlled comparison. No ablation is described that holds the rendering pipeline (camera sampling, lighting, resolution), optimizer, and evaluation protocol fixed while varying only the source dataset or its size (e.g., original Objaverse vs. Objaverse-XL subsets). Consequently the zero-shot gains cannot be isolated from unablated training or rendering choices.

    Authors: We appreciate the referee highlighting the value of a controlled ablation. The manuscript's experiments focus on demonstrating what becomes possible at the scale of Objaverse-XL by training Zero123 on over 100 million rendered views, achieving strong zero-shot generalization. While we did not include an explicit ablation that retrains with identical rendering, optimization, and evaluation settings on the original Objaverse versus Objaverse-XL subsets, the primary contribution is the public release of this much larger and more diverse dataset. In the revised manuscript we will add a dedicated paragraph in the experiments section that (1) explicitly compares the data scale and object diversity to the original Objaverse used in prior Zero123 work and (2) clarifies that all rendering parameters (camera sampling, lighting, resolution) are fully documented so that future controlled studies can be performed. We believe this addresses the isolation concern without overstating the current evidence. revision: yes

  2. Referee: [Experiments] Experiments section: quantitative results for the Zero123 zero-shot novel-view-synthesis task lack reported baselines with exact matching settings, ablation controls, and statistical significance measures (e.g., standard errors or multiple runs). This weakens the strength of the scaling demonstration.

    Authors: We agree that clearer reporting of baselines and any available measures of variability would improve the manuscript. The current results are presented as the performance obtained when training on the full Objaverse-XL scale. In the revision we will expand the experimental details to include side-by-side numerical comparison with the original Zero123 numbers, explicitly noting any differences in training settings or data volume. Because retraining the full 100-million-image model multiple times is computationally prohibitive, we will add an explicit limitations paragraph stating that results are from single runs and that statistical significance testing was not performed; we will also report any variance observed in smaller-scale pilot experiments if they exist. These changes will make the strength and limitations of the scaling demonstration more transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release and scaling observation

full rationale

The paper releases Objaverse-XL (10M+ 3D objects from diverse sources) and reports that training Zero123 on >100M multi-view renders from it yields strong zero-shot novel view synthesis. No derivation chain, equations, or 'predictions' are claimed. The central claim is an empirical scaling result, not a reduction of any output to fitted inputs or self-citations by construction. No self-definitional steps, uniqueness theorems, or ansatzes appear. The contribution is self-contained as a data release plus observed performance gains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the assumption that aggregating and deduplicating existing 3D assets produces a high-quality training distribution; no new physical axioms or invented entities are introduced.

axioms (1)
  • domain assumption Deduplication across heterogeneous 3D sources removes near-duplicates without discarding useful diversity
    Invoked in the dataset construction section to justify the final count and quality.

pith-pipeline@v0.9.0 · 5565 in / 1179 out tokens · 61852 ms · 2026-05-17T12:56:59.647505+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation

    cs.CV 2026-05 unverdicted novelty 7.0

    Mix3R mixes feed-forward reconstruction and generative 3D priors via Mixture-of-Transformers and overlap-based attention bias to achieve better-aligned 3D shapes and more accurate poses than either approach alone.

  2. Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

    cs.CV 2026-04 unverdicted novelty 7.0

    A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.

  3. MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

    cs.CV 2025-12 unverdicted novelty 7.0

    MoCapAnything reconstructs asset-specific BVH animations from monocular video by predicting 3D joint trajectories then applying constraint-aware inverse kinematics guided by a reference prompt encoder.

  4. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

    cs.CV 2023-09 unverdicted novelty 7.0

    DreamGaussian creates high-quality textured 3D meshes from single-view images in 2 minutes via generative Gaussian Splatting with mesh extraction and UV refinement.

  5. Velox: Learning Representations of 4D Geometry and Appearance

    cs.CV 2026-05 unverdicted novelty 6.0

    Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...

  6. 3D-ReGen: A Unified 3D Geometry Regeneration Framework

    cs.CV 2026-04 unverdicted novelty 6.0

    3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.

  7. Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations

    cs.CV 2026-04 unverdicted novelty 6.0

    RecGen achieves state-of-the-art 3D multi-object scene reconstruction from sparse RGB-D views by combining compositional synthetic scene generation with strong 3D shape priors, outperforming SAM3D by 30%+ in shape qua...

  8. PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.

  9. CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

    cs.CV 2024-06 unverdicted novelty 6.0

    CamCo equips image-to-video generators with Plücker-coordinate camera inputs and epipolar attention to improve 3D consistency and camera controllability.

  10. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    cs.RO 2024-03 accept novelty 6.0

    DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.

  11. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    cs.CV 2023-11 conditional novelty 6.0

    Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...

  12. SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

    cs.CV 2023-09 unverdicted novelty 6.0

    SyncDreamer produces multiview-consistent images from a single input image by jointly modeling their distribution and synchronizing intermediate diffusion states via 3D-aware attention.

  13. MVDream: Multi-view Diffusion for 3D Generation

    cs.CV 2023-08 conditional novelty 6.0

    MVDream is a multi-view diffusion model that functions as a generalizable 3D prior, enabling more consistent text-to-3D generation and few-shot 3D concept learning from 2D examples.

  14. Syn4D: A Multiview Synthetic 4D Dataset

    cs.CV 2026-05 unverdicted novelty 5.0

    Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.

  15. Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

    cs.CV 2023-10 unverdicted novelty 5.0

    Zero123++ produces high-quality 3D-consistent multi-view images from a single input by fine-tuning Stable Diffusion with targeted conditioning and training methods.

  16. LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

    cs.CV 2026-04 unverdicted novelty 3.0

    This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 16 Pith papers · 14 internal anchors

  1. [1]

    URL https://commoncrawl.org/the-data/. 3

  2. [2]

    Aanæs, R

    H. Aanæs, R. R. Jensen, G. V ogiatzis, E. Tola, and A. B. Dahl. Large-scale data for multiple-view stereopsis. International Journal of Computer Vision, pages 1–16, 2016. 9

  3. [3]

    Alayrac, J

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022. 2

  4. [4]

    L. Biewald. Experiment tracking with weights and biases, 2020. URL https://www.wandb. com/. Software available from wandb.com. 10

  5. [5]

    Blender - a 3d modelling and rendering package

    Blender Online Community. Blender - a 3d modelling and rendering package. https://www. blender.org, 2023. 10

  6. [6]

    Bostock, V

    M. Bostock, V . Ogievetsky, and J. Heer. D3: Data-driven documents.IEEE Transactions on Visualization and Computer Graphics, 2011. 10

  7. [7]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 2

  8. [8]

    Carion, F

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16 , pages 213–229. Springer, 2020. 2

  9. [9]

    A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015. 2, 3, 9

  10. [10]

    W. Chen, H. Ling, J. Gao, E. Smith, J. Lehtinen, A. Jacobson, and S. Fidler. Learning to predict 3d objects with an interpolation-based differentiable renderer. Advances in neural information processing systems, 32, 2019. 3

  11. [11]

    Cheng, I

    B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar. Masked-attention mask transformer for universal image segmentation. 2022. 2

  12. [12]

    C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14 , pages 628–644. Springer, 2016. 3

  13. [13]

    Collins, S

    J. Collins, S. Goel, K. Deng, A. Luthra, L. Xu, E. Gundogdu, X. Zhang, T. F. Y . Vicente, T. Dideriksen, H. Arora, et al. Abo: Dataset and benchmarks for real-world 3d object un- derstanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21126–21136, 2022. 3

  14. [14]

    Deitke, D

    M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022. 2, 3

  15. [15]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pages 248–255. Ieee, 2009. 2 10

  16. [16]

    K. Deng, A. Liu, J.-Y . Zhu, and D. Ramanan. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022. 9

  17. [17]

    Downs, A

    L. Downs, A. Francis, N. Koenig, B. Kinman, R. Hickman, K. Reymann, T. B. McHugh, and V . Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA) , pages 2553–2560. IEEE,

  18. [18]

    Falcon and The PyTorch Lightning team

    W. Falcon and The PyTorch Lightning team. PyTorch Lightning, Mar. 2019. URL https: //github.com/Lightning-AI/lightning. 10

  19. [19]

    H. Fu, R. Jia, L. Gao, M. Gong, B. Zhao, S. Maybank, and D. Tao. 3d-future: 3d furniture shape with texture. International Journal of Computer Vision, 129:3313–3337, 2021. 3

  20. [20]

    S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023. 6

  21. [21]

    Gebru, J

    T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. D. Iii, and K. Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021. 27

  22. [22]

    Gkioxari, J

    G. Gkioxari, J. Malik, and J. Johnson. Mesh r-cnn. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 9785–9795, 2019. 3

  23. [23]

    C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant. Array programmin...

  24. [24]

    K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision , pages 2961–2969, 2017. 2

  25. [25]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022. 2, 3

  26. [26]

    J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering , 9 (3):90–95, 2007. doi: 10.1109/MCSE.2007.55. 10

  27. [27]

    A. Jain, M. Tancik, and P. Abbeel. Putting nerf on a diet: Semantically consistent few-shot view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 5885–5894, 2021. 9

  28. [28]

    Shap-E: Generating Conditional 3D Implicit Functions

    H. Jun and A. Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023. 3

  29. [29]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. 3

  30. [30]

    H. Kato, Y . Ushiku, and T. Harada. Neural 3d mesh renderer. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3907–3916, 2018. 3

  31. [31]

    Segment Anything

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything.arXiv preprint arXiv:2304.02643, 2023. 2

  32. [32]

    J. J. Lim, H. Pirsiavash, and A. Torralba. Parsing ikea objects: Fine pose estimation. In Proceedings of the IEEE international conference on computer vision , pages 2992–2999, 2013. 3

  33. [33]

    C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y . Liu, and T.-Y . Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 300–309, 2023. 3

  34. [34]

    Liu and C

    R. Liu and C. V ondrick. Humans as light bulbs: 3d human reconstruction from thermal reflection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 12531–12542, 2023. 3 11

  35. [35]

    R. Liu, S. Menon, C. Mao, D. Park, S. Stent, and C. V ondrick. Shadows shed light on 3d objects. arXiv preprint arXiv:2206.08990, 2022. 3

  36. [36]

    R. Liu, R. Wu, B. V . Hoorick, P. Tokmakov, S. Zakharov, and C. V ondrick. Zero-1-to-3: Zero-shot one image to 3d object, 2023. 2, 3, 8, 14

  37. [37]

    J. Lu, C. Clark, R. Zellers, R. Mottaghi, and A. Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. ArXiv, abs/2206.08916, 2022. 2

  38. [38]

    Mescheder, M

    L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 4460–4470, 2019. 3

  39. [39]

    Mildenhall, P

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020. 3, 8

  40. [40]

    G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM , 38(11): 39–41, 1995. 3

  41. [41]

    Morrison, P

    D. Morrison, P. Corke, and J. Leitner. Egad! an evolved grasping analysis dataset for diversity and reproducibility in robotic manipulation. IEEE Robotics and Automation Letters , 5(3): 4368–4375, 2020. 3

  42. [42]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022. 3

  43. [43]

    Gpt-4 technical report

    OpenAI. Gpt-4 technical report. arXiv, 2023. 2, 3

  44. [44]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. 8

  45. [45]

    pandas development team

    T. pandas development team. pandas-dev/pandas: Pandas, Feb. 2020. URL https://doi. org/10.5281/zenodo.3509134. 10

  46. [46]

    K. Park, K. Rematas, A. Farhadi, and S. M. Seitz. Photoshape: Photorealistic materials for large-scale shape collections. arXiv preprint arXiv:1809.09761, 2018. 3

  47. [47]

    Paszke, S

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems , 32, 2019. 10

  48. [48]

    DreamFusion: Text-to-3D using 2D Diffusion

    B. Poole, A. Jain, J. T. Barron, and B. Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. 3, 8

  49. [49]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. 2

  50. [50]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR, 2021. 2, 6

  51. [51]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. 2

  52. [52]

    N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W.-Y . Lo, J. Johnson, and G. Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501, 2020. 3

  53. [53]

    S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems , 28, 2015. 2

  54. [54]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2

  55. [55]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022. 2, 6, 33

  56. [56]

    J. Tang. Stable-dreamfusion: Text-to-3d with stable-diffusion, 2022. https://github.com/ashawkey/stable-dreamfusion. 3 12

  57. [57]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2

  58. [58]

    H. Wang, X. Du, J. Li, R. A. Yeh, and G. Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023. 8

  59. [59]

    N. Wang, Y . Zhang, Z. Li, Y . Fu, W. Liu, and Y .-G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European conference on computer vision (ECCV), pages 52–67, 2018. 3

  60. [60]

    Q. Wang, Z. Wang, K. Genova, P. P. Srinivasan, H. Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, and T. Funkhouser. Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4690–4699, 2021. 9

  61. [61]

    M. L. Waskom. seaborn: statistical data visualization. Journal of Open Source Software, 6(60): 3021, 2021. doi: 10.21105/joss.03021. URL https://doi.org/10.21105/joss.03021. 10

  62. [62]

    C.-Y . Wu, J. Johnson, J. Malik, C. Feichtenhofer, and G. Gkioxari. Multiview compressive coding for 3d reconstruction. arXiv preprint arXiv:2301.08247, 2023. 3

  63. [63]

    T. Wu, J. Zhang, X. Fu, Y . Wang, J. Ren, L. Pan, W. Wu, L. Yang, J. Wang, C. Qian, D. Lin, and Z. Liu. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, recon- struction and generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3

  64. [64]

    A. Yu, V . Ye, M. Tancik, and A. Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021. 3, 8, 9

  65. [65]

    Zhang, P

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 8

  66. [66]

    C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y . Mao, X. Ma, A. Efrat, P. Yu, L. Yu, et al. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023. 8

  67. [67]

    Thingi10K: A Dataset of 10,000 3D-Printing Models

    Q. Zhou and A. Jacobson. Thingi10k: A dataset of 10,000 3d-printing models. arXiv preprint arXiv:1605.04797, 2016. 3 13 A Implementation Details A.1 Zero123-XL A batch size of 2048 is used during training with a learning rate of 1e-4. Different from the original paper [36], we performed a second-stage finetuning with a smaller learning rate of 5e-5 on a h...