arxiv: 2307.05663 · v1 · submitted 2023-07-11 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

Objaverse-XL: A Universe of 10M+ 3D Objects

Matt Deitke , Ruoshi Liu , Matthew Wallingford , Huong Ngo , Oscar Michel , Aditya Kusupati , Alan Fan , Christian Laforte

show 9 more authors

Vikram Voleti Samir Yitzhak Gadre Eli VanderBilt Aniruddha Kembhavi Carl Vondrick Georgia Gkioxari Kiana Ehsani Ludwig Schmidt Ali Farhadi

Authors on Pith no claims yet

Pith reviewed 2026-05-17 12:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords 3D datasetObjaverse-XLnovel view synthesiszero-shot generalization3D visionmulti-view renderinglarge-scale data

0 comments

The pith

Objaverse-XL supplies over 10 million 3D objects that let models like Zero123 reach strong zero-shot generalization on novel view synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Objaverse-XL, a dataset of more than 10 million deduplicated 3D objects gathered from manual designs, photogrammetry scans, and professional artifact scans. It notes that 3D vision has lagged behind 2D and language models mainly because of limited high-quality training data. The authors render over 100 million multi-view images from the dataset and train Zero123 for novel view synthesis. This produces strong zero-shot generalization on the task. The work releases the dataset to support further scaling experiments in 3D vision.

Core claim

Objaverse-XL comprises over 10 million deduplicated 3D objects drawn from manually designed models, photogrammetry scans of landmarks and everyday items, and professional scans of historic artifacts. Training Zero123 on novel view synthesis with more than 100 million multi-view rendered images from this collection yields strong zero-shot generalization abilities.

What carries the argument

Objaverse-XL dataset of over 10 million diverse 3D objects, which supplies the volume and variety needed to generate 100 million-plus multi-view training images for 3D models.

If this is right

Zero123 exhibits stronger zero-shot performance on novel view synthesis when trained at the scale enabled by Objaverse-XL.
3D vision models gain access to training volumes previously unavailable, mirroring data-scaling benefits seen in 2D vision.
Diversity across object sources supports generalization to varied object types and scanning styles.
The dataset supports additional large-scale experiments that were previously limited by data availability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scale of rendered multi-view data could be tested on other 3D tasks such as reconstruction or conditional generation.
Pretrained models derived from this data volume may transfer to downstream applications like robotics or AR that require 3D understanding.
Future work could measure how much further performance improves when the dataset size grows beyond the current 10 million objects.

Load-bearing premise

Observed gains in zero-shot performance stem primarily from the scale and diversity of the 3D objects rather than from other training choices or evaluation details that were not isolated.

What would settle it

Retraining Zero123 on the same procedure but with a much smaller subset of the objects and finding that zero-shot generalization remains comparable would falsify the central claim.

read the original abstract

Natural language processing and 2D vision models have attained remarkable proficiency on many tasks primarily by escalating the scale of training data. However, 3D vision tasks have not seen the same progress, in part due to the challenges of acquiring high-quality 3D data. In this work, we present Objaverse-XL, a dataset of over 10 million 3D objects. Our dataset comprises deduplicated 3D objects from a diverse set of sources, including manually designed objects, photogrammetry scans of landmarks and everyday items, and professional scans of historic and antique artifacts. Representing the largest scale and diversity in the realm of 3D datasets, Objaverse-XL enables significant new possibilities for 3D vision. Our experiments demonstrate the improvements enabled with the scale provided by Objaverse-XL. We show that by training Zero123 on novel view synthesis, utilizing over 100 million multi-view rendered images, we achieve strong zero-shot generalization abilities. We hope that releasing Objaverse-XL will enable further innovations in the field of 3D vision at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Objaverse-XL is a solid large-scale 3D dataset release that expands diversity and volume, but the zero-shot gains are not isolated from other training factors.

read the letter

The main point is a dataset of over 10 million 3D objects pulled from manual designs, photogrammetry scans, and professional sources, with deduplication to reduce overlap. This is a clear step up in scale and variety from earlier Objaverse versions and similar collections. The paper renders more than 100 million views and trains Zero123 to show stronger zero-shot novel view synthesis than before. That demonstration is useful as proof that bigger data can help 3D tasks in robotics and simulation. The sourcing mix and cleaning steps are the real new elements here, and releasing the data openly gives others a concrete resource to test scaling ideas. The experiments support the basic claim that more objects lead to better generalization, but they do not hold the rendering pipeline, camera sampling, or optimizer fixed while swapping in smaller datasets or prior Objaverse subsets. Without those controls, it is hard to attribute the lift primarily to Objaverse-XL rather than other setup choices. Baseline details and significance tests are also thin. This work is aimed at groups building 3D generative models or simulation environments who need raw volume and variety. It is worth a serious referee because dataset papers of this size have downstream impact even when the accompanying experiments are more illustrative than exhaustive. I would send it to review with a request for tighter ablations on the scaling effect.

Referee Report

2 major / 2 minor

Summary. The paper introduces Objaverse-XL, a dataset of over 10 million deduplicated 3D objects drawn from manual designs, photogrammetry scans of landmarks and everyday items, and professional scans of historic artifacts. It positions this as the largest and most diverse 3D dataset to date. The central empirical demonstration renders over 100 million multi-view images from these objects and trains Zero123 on novel view synthesis, reporting strong zero-shot generalization.

Significance. If the reported gains are attributable to dataset scale, Objaverse-XL supplies a valuable public resource that could enable the same kind of scaling progress in 3D vision that large corpora have produced in NLP and 2D vision. The open release of 10M+ objects together with the 100M+ rendered views is a concrete community asset; the authors deserve credit for the curation effort and for making the data available.

major comments (2)

[Experiments] Experiments section: the claim that 'improvements enabled with the scale provided by Objaverse-XL' are demonstrated by training Zero123 on >100M renders is not supported by a controlled comparison. No ablation is described that holds the rendering pipeline (camera sampling, lighting, resolution), optimizer, and evaluation protocol fixed while varying only the source dataset or its size (e.g., original Objaverse vs. Objaverse-XL subsets). Consequently the zero-shot gains cannot be isolated from unablated training or rendering choices.
[Experiments] Experiments section: quantitative results for the Zero123 zero-shot novel-view-synthesis task lack reported baselines with exact matching settings, ablation controls, and statistical significance measures (e.g., standard errors or multiple runs). This weakens the strength of the scaling demonstration.

minor comments (2)

[Abstract] Abstract: the quantitative improvements (e.g., specific metrics on zero-shot NVS) are not stated; adding one or two headline numbers would strengthen the summary.
[Dataset] Dataset section: the deduplication procedure and any quantitative measure of diversity (e.g., category coverage or geometric variation statistics) should be described more explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of our experimental results.

read point-by-point responses

Referee: [Experiments] Experiments section: the claim that 'improvements enabled with the scale provided by Objaverse-XL' are demonstrated by training Zero123 on >100M renders is not supported by a controlled comparison. No ablation is described that holds the rendering pipeline (camera sampling, lighting, resolution), optimizer, and evaluation protocol fixed while varying only the source dataset or its size (e.g., original Objaverse vs. Objaverse-XL subsets). Consequently the zero-shot gains cannot be isolated from unablated training or rendering choices.

Authors: We appreciate the referee highlighting the value of a controlled ablation. The manuscript's experiments focus on demonstrating what becomes possible at the scale of Objaverse-XL by training Zero123 on over 100 million rendered views, achieving strong zero-shot generalization. While we did not include an explicit ablation that retrains with identical rendering, optimization, and evaluation settings on the original Objaverse versus Objaverse-XL subsets, the primary contribution is the public release of this much larger and more diverse dataset. In the revised manuscript we will add a dedicated paragraph in the experiments section that (1) explicitly compares the data scale and object diversity to the original Objaverse used in prior Zero123 work and (2) clarifies that all rendering parameters (camera sampling, lighting, resolution) are fully documented so that future controlled studies can be performed. We believe this addresses the isolation concern without overstating the current evidence. revision: yes
Referee: [Experiments] Experiments section: quantitative results for the Zero123 zero-shot novel-view-synthesis task lack reported baselines with exact matching settings, ablation controls, and statistical significance measures (e.g., standard errors or multiple runs). This weakens the strength of the scaling demonstration.

Authors: We agree that clearer reporting of baselines and any available measures of variability would improve the manuscript. The current results are presented as the performance obtained when training on the full Objaverse-XL scale. In the revision we will expand the experimental details to include side-by-side numerical comparison with the original Zero123 numbers, explicitly noting any differences in training settings or data volume. Because retraining the full 100-million-image model multiple times is computationally prohibitive, we will add an explicit limitations paragraph stating that results are from single runs and that statistical significance testing was not performed; we will also report any variance observed in smaller-scale pilot experiments if they exist. These changes will make the strength and limitations of the scaling demonstration more transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release and scaling observation

full rationale

The paper releases Objaverse-XL (10M+ 3D objects from diverse sources) and reports that training Zero123 on >100M multi-view renders from it yields strong zero-shot novel view synthesis. No derivation chain, equations, or 'predictions' are claimed. The central claim is an empirical scaling result, not a reduction of any output to fitted inputs or self-citations by construction. No self-definitional steps, uniqueness theorems, or ansatzes appear. The contribution is self-contained as a data release plus observed performance gains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the assumption that aggregating and deduplicating existing 3D assets produces a high-quality training distribution; no new physical axioms or invented entities are introduced.

axioms (1)

domain assumption Deduplication across heterogeneous 3D sources removes near-duplicates without discarding useful diversity
Invoked in the dataset construction section to justify the final count and quality.

pith-pipeline@v0.9.0 · 5565 in / 1179 out tokens · 61852 ms · 2026-05-17T12:56:59.647505+00:00 · methodology

discussion (0)

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation
cs.CV 2026-05 unverdicted novelty 7.0

Mix3R mixes feed-forward reconstruction and generative 3D priors via Mixture-of-Transformers and overlap-based attention bias to achieve better-aligned 3D shapes and more accurate poses than either approach alone.
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
cs.CV 2026-04 unverdicted novelty 7.0

A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
cs.CV 2025-12 unverdicted novelty 7.0

MoCapAnything reconstructs asset-specific BVH animations from monocular video by predicting 3D joint trajectories then applying constraint-aware inverse kinematics guided by a reference prompt encoder.
DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation
cs.CV 2023-09 unverdicted novelty 7.0

DreamGaussian creates high-quality textured 3D meshes from single-view images in 2 minutes via generative Gaussian Splatting with mesh extraction and UV refinement.
Velox: Learning Representations of 4D Geometry and Appearance
cs.CV 2026-05 unverdicted novelty 6.0

Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
3D-ReGen: A Unified 3D Geometry Regeneration Framework
cs.CV 2026-04 unverdicted novelty 6.0

3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.
Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations
cs.CV 2026-04 unverdicted novelty 6.0

RecGen achieves state-of-the-art 3D multi-object scene reconstruction from sparse RGB-D views by combining compositional synthetic scene generation with strong 3D shape priors, outperforming SAM3D by 30%+ in shape qua...
PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation
cs.CV 2024-06 unverdicted novelty 6.0

CamCo equips image-to-video generators with Plücker-coordinate camera inputs and epipolar attention to improve 3D consistency and camera controllability.
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
cs.RO 2024-03 accept novelty 6.0

DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
cs.CV 2023-11 conditional novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
SyncDreamer: Generating Multiview-consistent Images from a Single-view Image
cs.CV 2023-09 unverdicted novelty 6.0

SyncDreamer produces multiview-consistent images from a single input image by jointly modeling their distribution and synchronizing intermediate diffusion states via 3D-aware attention.
MVDream: Multi-view Diffusion for 3D Generation
cs.CV 2023-08 conditional novelty 6.0

MVDream is a multi-view diffusion model that functions as a generalizable 3D prior, enabling more consistent text-to-3D generation and few-shot 3D concept learning from 2D examples.
Syn4D: A Multiview Synthetic 4D Dataset
cs.CV 2026-05 unverdicted novelty 5.0

Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model
cs.CV 2023-10 unverdicted novelty 5.0

Zero123++ produces high-quality 3D-consistent multi-view images from a single input by fine-tuning Stable Diffusion with targeted conditioning and training methods.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
cs.CV 2026-04 unverdicted novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 16 Pith papers · 14 internal anchors

[1]

URL https://commoncrawl.org/the-data/. 3

work page
[2]

Aanæs, R

H. Aanæs, R. R. Jensen, G. V ogiatzis, E. Tola, and A. B. Dahl. Large-scale data for multiple-view stereopsis. International Journal of Computer Vision, pages 1–16, 2016. 9

work page 2016
[3]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022. 2

work page 2022
[4]

L. Biewald. Experiment tracking with weights and biases, 2020. URL https://www.wandb. com/. Software available from wandb.com. 10

work page 2020
[5]

Blender - a 3d modelling and rendering package

Blender Online Community. Blender - a 3d modelling and rendering package. https://www. blender.org, 2023. 10

work page 2023
[6]

Bostock, V

M. Bostock, V . Ogievetsky, and J. Heer. D3: Data-driven documents.IEEE Transactions on Visualization and Computer Graphics, 2011. 10

work page 2011
[7]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 2

work page 1901
[8]

Carion, F

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16 , pages 213–229. Springer, 2020. 2

work page 2020
[9]

A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015. 2, 3, 9

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

W. Chen, H. Ling, J. Gao, E. Smith, J. Lehtinen, A. Jacobson, and S. Fidler. Learning to predict 3d objects with an interpolation-based differentiable renderer. Advances in neural information processing systems, 32, 2019. 3

work page 2019
[11]

Cheng, I

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar. Masked-attention mask transformer for universal image segmentation. 2022. 2

work page 2022
[12]

C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14 , pages 628–644. Springer, 2016. 3

work page 2016
[13]

Collins, S

J. Collins, S. Goel, K. Deng, A. Luthra, L. Xu, E. Gundogdu, X. Zhang, T. F. Y . Vicente, T. Dideriksen, H. Arora, et al. Abo: Dataset and benchmarks for real-world 3d object un- derstanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21126–21136, 2022. 3

work page 2022
[14]

Deitke, D

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022. 2, 3

work page arXiv 2022
[15]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pages 248–255. Ieee, 2009. 2 10

work page 2009
[16]

K. Deng, A. Liu, J.-Y . Zhu, and D. Ramanan. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022. 9

work page 2022
[17]

Downs, A

L. Downs, A. Francis, N. Koenig, B. Kinman, R. Hickman, K. Reymann, T. B. McHugh, and V . Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA) , pages 2553–2560. IEEE,

work page 2022
[18]

Falcon and The PyTorch Lightning team

W. Falcon and The PyTorch Lightning team. PyTorch Lightning, Mar. 2019. URL https: //github.com/Lightning-AI/lightning. 10

work page 2019
[19]

H. Fu, R. Jia, L. Gao, M. Gong, B. Zhao, S. Maybank, and D. Tao. 3d-future: 3d furniture shape with texture. International Journal of Computer Vision, 129:3313–3337, 2021. 3

work page 2021
[20]

S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023. 6

work page arXiv 2023
[21]

Gebru, J

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. D. Iii, and K. Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021. 27

work page 2021
[22]

Gkioxari, J

G. Gkioxari, J. Malik, and J. Johnson. Mesh r-cnn. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 9785–9795, 2019. 3

work page 2019
[23]

C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant. Array programmin...

work page doi:10.1038/s41586-020-2649-2 2020
[24]

K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision , pages 2961–2969, 2017. 2

work page 2017
[25]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering , 9 (3):90–95, 2007. doi: 10.1109/MCSE.2007.55. 10

work page doi:10.1109/mcse.2007.55 2007
[27]

A. Jain, M. Tancik, and P. Abbeel. Putting nerf on a diet: Semantically consistent few-shot view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 5885–5894, 2021. 9

work page 2021
[28]

Shap-E: Generating Conditional 3D Implicit Functions

H. Jun and A. Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2001
[30]

H. Kato, Y . Ushiku, and T. Harada. Neural 3d mesh renderer. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3907–3916, 2018. 3

work page 2018
[31]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything.arXiv preprint arXiv:2304.02643, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

J. J. Lim, H. Pirsiavash, and A. Torralba. Parsing ikea objects: Fine pose estimation. In Proceedings of the IEEE international conference on computer vision , pages 2992–2999, 2013. 3

work page 2013
[33]

C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y . Liu, and T.-Y . Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 300–309, 2023. 3

work page 2023
[34]

Liu and C

R. Liu and C. V ondrick. Humans as light bulbs: 3d human reconstruction from thermal reflection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 12531–12542, 2023. 3 11

work page 2023
[35]

R. Liu, S. Menon, C. Mao, D. Park, S. Stent, and C. V ondrick. Shadows shed light on 3d objects. arXiv preprint arXiv:2206.08990, 2022. 3

work page arXiv 2022
[36]

R. Liu, R. Wu, B. V . Hoorick, P. Tokmakov, S. Zakharov, and C. V ondrick. Zero-1-to-3: Zero-shot one image to 3d object, 2023. 2, 3, 8, 14

work page 2023
[37]

J. Lu, C. Clark, R. Zellers, R. Mottaghi, and A. Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. ArXiv, abs/2206.08916, 2022. 2

work page arXiv 2022
[38]

Mescheder, M

L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 4460–4470, 2019. 3

work page 2019
[39]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020. 3, 8

work page 2020
[40]

G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM , 38(11): 39–41, 1995. 3

work page 1995
[41]

Morrison, P

D. Morrison, P. Corke, and J. Leitner. Egad! an evolved grasping analysis dataset for diversity and reproducibility in robotic manipulation. IEEE Robotics and Automation Letters , 5(3): 4368–4375, 2020. 3

work page 2020
[42]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Gpt-4 technical report

OpenAI. Gpt-4 technical report. arXiv, 2023. 2, 3

work page 2023
[44]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. 8

work page 2022
[45]

pandas development team

T. pandas development team. pandas-dev/pandas: Pandas, Feb. 2020. URL https://doi. org/10.5281/zenodo.3509134. 10

work page doi:10.5281/zenodo.3509134 2020
[46]

K. Park, K. Rematas, A. Farhadi, and S. M. Seitz. Photoshape: Photorealistic materials for large-scale shape collections. arXiv preprint arXiv:1809.09761, 2018. 3

work page internal anchor Pith review Pith/arXiv arXiv 2018
[47]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems , 32, 2019. 10

work page 2019
[48]

DreamFusion: Text-to-3D using 2D Diffusion

B. Poole, A. Jain, J. T. Barron, and B. Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. 2

work page 2019
[50]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR, 2021. 2, 6

work page 2021
[51]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W.-Y . Lo, J. Johnson, and G. Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2007
[53]

S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems , 28, 2015. 2

work page 2015
[54]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2

work page 2022
[55]

LAION-5B: An open large-scale dataset for training next generation image-text models

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022. 2, 6, 33

work page internal anchor Pith review Pith/arXiv arXiv 2022
[56]

J. Tang. Stable-dreamfusion: Text-to-3d with stable-diffusion, 2022. https://github.com/ashawkey/stable-dreamfusion. 3 12

work page 2022
[57]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

H. Wang, X. Du, J. Li, R. A. Yeh, and G. Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023. 8

work page 2023
[59]

N. Wang, Y . Zhang, Z. Li, Y . Fu, W. Liu, and Y .-G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European conference on computer vision (ECCV), pages 52–67, 2018. 3

work page 2018
[60]

Q. Wang, Z. Wang, K. Genova, P. P. Srinivasan, H. Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, and T. Funkhouser. Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4690–4699, 2021. 9

work page 2021
[61]

M. L. Waskom. seaborn: statistical data visualization. Journal of Open Source Software, 6(60): 3021, 2021. doi: 10.21105/joss.03021. URL https://doi.org/10.21105/joss.03021. 10

work page doi:10.21105/joss.03021 2021
[62]

C.-Y . Wu, J. Johnson, J. Malik, C. Feichtenhofer, and G. Gkioxari. Multiview compressive coding for 3d reconstruction. arXiv preprint arXiv:2301.08247, 2023. 3

work page arXiv 2023
[63]

T. Wu, J. Zhang, X. Fu, Y . Wang, J. Ren, L. Pan, W. Wu, L. Yang, J. Wang, C. Qian, D. Lin, and Z. Liu. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, recon- struction and generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3

work page 2023
[64]

A. Yu, V . Ye, M. Tancik, and A. Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021. 3, 8, 9

work page 2021
[65]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 8

work page 2018
[66]

C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y . Mao, X. Ma, A. Efrat, P. Yu, L. Yu, et al. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023. 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Thingi10K: A Dataset of 10,000 3D-Printing Models

Q. Zhou and A. Jacobson. Thingi10k: A dataset of 10,000 3d-printing models. arXiv preprint arXiv:1605.04797, 2016. 3 13 A Implementation Details A.1 Zero123-XL A batch size of 2048 is used during training with a learning rate of 1e-4. Different from the original paper [36], we performed a second-stage finetuning with a smaller learning rate of 5e-5 on a h...

work page internal anchor Pith review Pith/arXiv arXiv 2016