pith. machine review for the scientific record. sign in

arxiv: 2311.04400 · v2 · submitted 2023-11-08 · 💻 cs.CV · cs.AI· cs.GR· cs.LG

Recognition: 2 theorem links

· Lean Theorem

LRM: Large Reconstruction Model for Single Image to 3D

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GRcs.LG
keywords 3D reconstructionsingle-image 3Dneural radiance fieldtransformer modellarge-scale training
0
0 comments X

The pith

A 500-million-parameter transformer reconstructs 3D objects from single images in five seconds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LRM is the first model to use a large transformer to predict a full 3D neural radiance field from one photo. Earlier methods trained on small category-specific datasets and often needed extra steps or fine-tuning. Here the team scales to 500 million parameters and trains on roughly one million objects from synthetic and real multi-view captures. The result is fast, general reconstruction that handles everyday photos and even AI-generated images without special preparation.

Core claim

The central discovery is that a highly scalable transformer with 500 million parameters can be trained end-to-end to predict a neural radiance field directly from a single input image. This is achieved using massive multi-view data containing around 1 million objects from both synthetic and real sources, leading to reconstructions that generalize across various inputs.

What carries the argument

The transformer-based architecture with 500 million learnable parameters that directly predicts a neural radiance field from the input image.

If this is right

  • High-quality 3D reconstructions become possible from real-world in-the-wild images.
  • Images created by generative models can be turned into 3D models without extra supervision.
  • Reconstruction completes in approximately five seconds on standard hardware.
  • The approach eliminates the need for category-specific fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Fast single-image 3D could enable real-time applications in augmented reality where users capture objects on the fly.
  • Combining this with text-to-image generators may allow text-to-3D pipelines with minimal manual work.
  • Scaling similar models further might support reconstruction of more complex scenes rather than isolated objects.

Load-bearing premise

End-to-end training of the large transformer on mixed synthetic and real multi-view data will yield high-quality, generalizable 3D models from arbitrary single images.

What would settle it

A test set of challenging in-the-wild images where the reconstructed NeRF either fails to match multi-view consistency or produces low-fidelity renderings compared to ground truth.

read the original abstract

We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds. In contrast to many previous methods that are trained on small-scale datasets such as ShapeNet in a category-specific fashion, LRM adopts a highly scalable transformer-based architecture with 500 million learnable parameters to directly predict a neural radiance field (NeRF) from the input image. We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects, including both synthetic renderings from Objaverse and real captures from MVImgNet. This combination of a high-capacity model and large-scale training data empowers our model to be highly generalizable and produce high-quality 3D reconstructions from various testing inputs, including real-world in-the-wild captures and images created by generative models. Video demos and interactable 3D meshes can be found on our LRM project webpage: https://yiconghong.me/LRM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces the Large Reconstruction Model (LRM), a 500-million-parameter transformer that directly predicts a neural radiance field (NeRF) from a single input image in approximately 5 seconds. Trained end-to-end on roughly 1 million objects drawn from synthetic Objaverse renders and real MVImgNet captures, the model is claimed to deliver high-quality, category-agnostic 3D reconstructions that generalize to in-the-wild photographs and images produced by generative models.

Significance. If the performance claims are substantiated by rigorous quantitative evaluation, the work would mark a meaningful scaling step in single-image 3D reconstruction, moving beyond small-scale category-specific datasets such as ShapeNet toward large-capacity models trained on diverse multi-view data. The reported inference speed and end-to-end training approach constitute concrete strengths that could influence practical deployment in AR/VR and content-creation pipelines.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central claims that LRM produces 'high-quality' and 'highly generalizable' NeRFs are unsupported by any quantitative metrics (PSNR, SSIM, LPIPS, or geometric error), ablation studies, or baseline comparisons. Without these numbers on held-out synthetic and real test splits, the empirical assertions cannot be evaluated.
  2. [§4 and §5] §4 (Experiments) and §5 (Results): the generalization argument to arbitrary in-the-wild images rests on the untested assumption that the Objaverse + MVImgNet distribution covers heavy background clutter, extreme lighting, partial occlusions, and non-canonical viewpoints. No quantitative results on such out-of-distribution real-world test cases are reported, leaving the weakest assumption unaddressed.
  3. [§3] §3 (Method): the precise mechanism by which the 500 M-parameter transformer outputs the NeRF (tri-plane features, direct MLP weights, or another representation) and the exact training objective (volume rendering loss, regularization terms) are described at a high level only; additional equations or pseudocode are required for reproducibility.
minor comments (1)
  1. [Abstract] The project webpage URL is provided but should be accompanied by a permanent archive link (e.g., Internet Archive) to ensure long-term accessibility of the video demos and interactive meshes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript to provide stronger empirical support and methodological clarity while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claims that LRM produces 'high-quality' and 'highly generalizable' NeRFs are unsupported by any quantitative metrics (PSNR, SSIM, LPIPS, or geometric error), ablation studies, or baseline comparisons. Without these numbers on held-out synthetic and real test splits, the empirical assertions cannot be evaluated.

    Authors: We agree that quantitative metrics strengthen the evaluation. The manuscript currently emphasizes qualitative results and visual comparisons across diverse inputs to highlight the model's practical capabilities. In the revision we will add PSNR, SSIM, and LPIPS numbers on held-out synthetic and real test splits from Objaverse and MVImgNet, together with baseline comparisons where feasible. revision: yes

  2. Referee: [§4 and §5] §4 (Experiments) and §5 (Results): the generalization argument to arbitrary in-the-wild images rests on the untested assumption that the Objaverse + MVImgNet distribution covers heavy background clutter, extreme lighting, partial occlusions, and non-canonical viewpoints. No quantitative results on such out-of-distribution real-world test cases are reported, leaving the weakest assumption unaddressed.

    Authors: MVImgNet already contains real captures with varied backgrounds and viewpoints, and the qualitative results on in-the-wild and generative-model images provide supporting evidence. We will augment the revision with additional quantitative metrics on curated challenging real-world examples that exhibit clutter, lighting variation, and occlusions. revision: partial

  3. Referee: [§3] §3 (Method): the precise mechanism by which the 500 M-parameter transformer outputs the NeRF (tri-plane features, direct MLP weights, or another representation) and the exact training objective (volume rendering loss, regularization terms) are described at a high level only; additional equations or pseudocode are required for reproducibility.

    Authors: We will expand Section 3 with explicit equations for the transformer-to-triplane mapping, the NeRF representation, the volume-rendering loss, and any regularization terms. Pseudocode for the forward pass and end-to-end training loop will also be included. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claim from end-to-end training

full rationale

The paper's central claim is an empirical statement that a 500M-parameter transformer, trained end-to-end on ~1M objects from Objaverse and MVImgNet, produces high-quality NeRFs from single images. No derivation chain, equations, or first-principles result is presented that reduces the output to a quantity defined by the model's own fitted parameters or self-referential definitions. The architecture is a standard scalable transformer; training is described as direct prediction without any fitted-input-called-prediction or self-definitional steps. No load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the provided text. The result is independently testable via held-out evaluation and does not rely on renaming known results or importing uniqueness from prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on the assumption that large-scale end-to-end training on mixed synthetic and real multi-view data suffices for generalization; no new physical entities are postulated.

free parameters (1)
  • transformer parameter count
    500 million learnable parameters chosen as model capacity; their values are fitted during training on the million-object corpus.
axioms (1)
  • domain assumption End-to-end supervised training on multi-view renderings and captures produces a feed-forward model that generalizes to single-view in-the-wild inputs
    Invoked in the description of training and claimed generalization.

pith-pipeline@v0.9.0 · 5511 in / 1260 out tokens · 41841 ms · 2026-05-15T10:05:37.989754+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects, including both synthetic renderings from Objaverse and real captures from MVImgNet.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

    cs.CR 2026-05 conditional novelty 8.0

    Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...

  2. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 7.0

    R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.

  3. MiXR: Harvesting and Recomposing Geometry from Real-World Objects for In-Situ 3D Design

    cs.HC 2026-05 unverdicted novelty 7.0

    MiXR enables in-situ 3D design by harvesting real-world geometry for user-defined compositions that generative AI then refines, outperforming text-only generative methods in control and fidelity per a 12-person study.

  4. MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation

    cs.GR 2026-05 unverdicted novelty 7.0

    MeshFIM enables local low-poly mesh editing by autoregressively filling target regions conditioned on context, using boundary markers, positional embeddings, and a gated geometry encoder to enforce attachment, topolog...

  5. Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures

    cs.CV 2026-05 unverdicted novelty 7.0

    HeadsUp maps multi-view captures to UV-parameterized 3D Gaussians on a template via an encoder-decoder, achieving state-of-the-art quality and generalization after training on more than 10,000 subjects.

  6. URoPE: Universal Relative Position Embedding across Geometric Spaces

    cs.CV 2026-04 unverdicted novelty 7.0

    URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, trac...

  7. TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

    cs.CV 2026-04 unverdicted novelty 7.0

    TokenGS uses learnable Gaussian tokens in an encoder-decoder architecture to regress 3D means directly, achieving SOTA feed-forward reconstruction on static and dynamic scenes with better robustness.

  8. Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

    cs.CV 2026-04 unverdicted novelty 7.0

    A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.

  9. AnchorSplat: Feed-Forward 3D Gaussian Splatting with 3D Geometric Priors

    cs.CV 2026-04 unverdicted novelty 7.0

    AnchorSplat uses anchor-aligned 3D Gaussians guided by geometric priors for feed-forward scene reconstruction, achieving SOTA novel view synthesis on ScanNet++ with fewer primitives and better view consistency.

  10. High-Fidelity Single-Image Head Modeling with Industry-Grade Topology

    cs.CV 2026-05 unverdicted novelty 6.0

    A single-image head reconstruction method uses coarse-to-fine optimization with normal consistency, landmarks, and geometry-aware constraints on curvature and conformality to produce meshes with industry-grade topolog...

  11. Structured 3D Latents Are Surprisingly Powerful: Unleashing Generalizable Style with 2D Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    DiLAST optimizes 3D latents via guidance from a 2D diffusion model to enable generalizable style transfer for OOD styles in 3D asset generation.

  12. Repurposing 3D Generative Model for Autoregressive Layout Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    LaviGen turns 3D generative models into an autoregressive layout generator that models geometric and physical constraints, delivering 19% higher physical plausibility and 65% faster inference on the LayoutVLM benchmark.

  13. Real-Time Human Reconstruction and Animation using Feed-Forward Gaussian Splatting

    cs.CV 2026-04 unverdicted novelty 6.0

    A feed-forward network predicts per-SMPL-X-vertex 3D Gaussians in canonical space from multi-view RGB images, enabling single-pass reconstruction and real-time animation via linear blend skinning.

  14. MemoryDiorama: Generating Dynamic 3D Diorama from Everyday Photos for Memory Recall

    cs.HC 2026-04 unverdicted novelty 6.0

    MemoryDiorama generates animated 3D dioramas from photos via LLM scene analysis and generative components, yielding richer autobiographical recall than photo-only or static diorama baselines.

  15. LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows

    cs.CV 2026-04 conditional novelty 6.0

    LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.

  16. SegviGen: Repurposing 3D Generative Model for Part Segmentation

    cs.CV 2026-03 unverdicted novelty 6.0

    SegviGen shows pretrained 3D generative models can be repurposed for part segmentation via voxel colorization, beating prior methods by 40% interactively and 15% on full segmentation using only 0.32% of labeled data.

  17. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 5.0

    R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.

  18. Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures

    cs.CV 2026-05 unverdicted novelty 5.0

    Pith review generated a malformed one-line summary.

  19. Pose-Aware Diffusion for 3D Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.

  20. Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images

    cs.CV 2026-04 unverdicted novelty 5.0

    Unposed-to-3D learns simulation-ready 3D vehicle models from unposed real images by predicting camera parameters for photometric self-supervision, then adding scale prediction and harmonization.

  21. UniMesh: Unifying 3D Mesh Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.

  22. AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation

    cs.CV 2026-04 unverdicted novelty 4.0

    AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantica...

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 20 Pith papers · 15 internal anchors

  1. [1]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,

  2. [2]

    Keep it smpl: Automatic estimation of 3d human pose and shape from a single image

    Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Octo- ber 11-14, 2016, Proceedings, Part V 14, pp. 561–578. Springer,

  3. [3]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

  4. [4]

    ShapeNet: An Information-Rich 3D Model Repository

    Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012,

  5. [5]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,

  6. [6]

    3d-r2n2: A unified approach for single and multi-view 3d object reconstruction

    Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceed- ings, Part VIII 14, pp. 628–644. Springer,

  7. [7]

    Imagenet: A large-scale hi- erarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi- erarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,

  8. [8]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

  9. [9]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    11 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929,

  10. [10]

    Google scanned objects: A high-quality dataset of 3d scanned household items

    Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pp. 2553–2560. IEEE,

  11. [11]

    Learning a predictable and generative vector representation for objects

    Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning a predictable and generative vector representation for objects. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14 , pp. 484–499. Springer,

  12. [12]

    Shape and viewpoint without keypoints

    Shubham Goel, Angjoo Kanazawa, and Jitendra Malik. Shape and viewpoint without keypoints. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, pp. 88–104. Springer,

  13. [13]

    Shap-e: Generating conditional 3d implicit functions

    12 Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463,

  14. [14]

    Lerf: Lan- guage embedded radiance fields

    Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Lan- guage embedded radiance fields. arXiv preprint arXiv:2303.09553,

  15. [15]

    Instant3d: Fast text-to-3d with sparse-view gen- eration and large reconstruction model

    Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view gen- eration and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023a. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre- trai...

  16. [16]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b. Xueting Li, Sifei Liu, Kihwan Kim, Shalini De Mello, Varun Jampani, Ming-Hsuan Yang, and Jan Kautz. Self-supervised single-view 3d reconstruction via semanti...

  17. [17]

    One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization

    Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Zexiang Xu, Hao Su, et al. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023a. Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint ...

  18. [18]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983,

  19. [19]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,

  20. [20]

    Scalable 3d captioning with pre- trained models

    Tiange Luo, Chris Rockwell, Honglak Lee, and Justin Johnson. Scalable 3d captioning with pre- trained models. arXiv preprint arXiv:2306.07279,

  21. [21]

    3D-LMNet: Latent Embedding Matching for Accurate and Diverse 3D Point Cloud Reconstruction from a Single Image

    Priyanka Mandikal, KL Navaneet, Mayank Agarwal, and R Venkatesh Babu. 3d-lmnet: Latent embedding matching for accurate and diverse 3d point cloud reconstruction from a single image. arXiv preprint arXiv:1807.07796,

  22. [22]

    URL https://doi.org/10.1145/3528223

    doi: 10.1145/3528223.3530127. URL https://doi.org/10.1145/3528223. 3530127. Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) , pp. 807–814,

  23. [23]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751,

  24. [24]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748,

  25. [25]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988,

  26. [26]

    Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors

    Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin- Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843,

  27. [27]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125,

  28. [28]

    Gina-3d: Learning to generate implicit neural assets in the wild

    Bokui Shen, Xinchen Yan, Charles R Qi, Mahyar Najibi, Boyang Deng, Leonidas Guibas, Yin Zhou, and Dragomir Anguelov. Gina-3d: Learning to generate implicit neural assets in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 4913–4926, 2023a. Qiuhong Shen, Xingyi Yang, and Xinchao Wang. Anything-3d: Towards...

  29. [29]

    Lxmert: Learning cross-modality encoder representations from transformers

    Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from trans- formers. arXiv preprint arXiv:1908.07490,

  30. [30]

    Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior

    Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. arXiv preprint arXiv:2303.14184,

  31. [31]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,

  32. [32]

    Internvideo: General video foundation models via generative and discriminative learning

    Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191,

  33. [33]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917,

  34. [34]

    Nerf++: Analyzing and improving neural radiance fields

    16 Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492,

  35. [35]

    NeRF, when coupled with differentiable volume rendering, can be optimized with just image reconstruction losses

    17 APPENDICES A B ACKGROUND OF MODEL COMPONENTS A.1 N ERF We adopt NeRF (Mildenhall et al., 2021), specifically the compact triplane NeRF variant (Chan et al., 2022), as our 3D representation to predict in LRM. NeRF, when coupled with differentiable volume rendering, can be optimized with just image reconstruction losses. At the core of NeRF (Mildenhall et al.,

  36. [36]

    and its variants (Chan et al., 2022; Chen et al., 2022a; M¨uller et al., 2022; Sun et al.,

  37. [37]

    is a spatially-varying color (modeling appearance) and density (modeling geometry) field function. 10 Given a 3D point p, the color and density field pu, σq can be written as: pu, σq “ MLPnerf pfθppqq, (7) where the spatial encoding fθ is used to facilitate the MLPnerf to learn high-frequency signals. Different NeRF variants (Chan et al., 2022; Chen et al...

  38. [38]

    typically differ from each other in terms of the choice of the spatial encoding and the size of the MLP. In this work, we use the triplane spatial encoding function proposed by EG3D (Chan et al., 2022), because of its low tokenization complexity ( OpN 2q as opposed to a voxel grid’s OpN 3q complexity, where N is spatial resolution). Images are rendered fr...

  39. [39]

    Then the output is the weighted summation of the conditions tyiu with respect to the attention score αi

    This attention score measures the relationship between input and conditions. Then the output is the weighted summation of the conditions tyiu with respect to the attention score αi. αi “ softmaxitxJyiu (10) Attnpx; tyiuiq “ ÿ i αiyi (11) For some specific cases (e.g., in the transformer attention layer below), the attention operator wants to differentiate...

  40. [40]

    The multi-head attention is im- plemented by first splitting the input features into smaller queries

    is proposed. The multi-head attention is im- plemented by first splitting the input features into smaller queries. rx1, . . . , xnh s “ x (14) where nh is the number of heads. Meanwhile, yi and zi are split into tyk i uk and tzk i uk in a sim- ilar way. After that, the output of each head is computed independently and the final output is a concatenation o...

  41. [41]

    The intermediate hidden dimension is 4 times of the model dimension

    activation in between. The intermediate hidden dimension is 4 times of the model dimension. 19 Layer Normalization We take the default LayerNorm (LN) implementation in PyTorch (Paszke et al., 2019). Besides the LN layers in ModLN as in Sec. 3.2, we follow the Pre-LN architecture to also apply LN to the final output of transformers, e.g., the output of ViT...

  42. [42]

    We apply a gradient clipping of 1.0 and a weight decay of 0.05

    to be 0.95. We apply a gradient clipping of 1.0 and a weight decay of 0.05. The weight decay are only applied on the weights that are not bias and not in the layer normalization layer. We use BF16 precision in in the mixed precision training. To save computational cost in training, we resize the reference novel views from 512ˆ512 to a randomly chosen reso...

  43. [43]

    We can see that our LRM consistently outperforms previous approaches in all metrics

    and measured the novel view synthetic quality of 20 reference views (FID, CLIP-Similarity (Radford et al., 2021), PSNR, LPIPS (Zhang et al., 2018)) and the geometric quality (Chamfer Distance), as shown in the Table below. We can see that our LRM consistently outperforms previous approaches in all metrics. Table 1: Comparison between LRM and state-of-the-...

  44. [44]

    Removing it from training will decrease the CLIP-Similarity, SSIM, and LPIPS scores to 74.7, 76.4, and 29.4, respectively

    has a huge impact on the results. Removing it from training will decrease the CLIP-Similarity, SSIM, and LPIPS scores to 74.7, 76.4, and 29.4, respectively. E V ISUALIZATIONS We present more visualizations of the reconstructed 3D shapes in the following pages. The in- put images include photos captured by our phone camera, images from Objaverse (Deitke et...