pith. machine review for the scientific record. sign in

arxiv: 2506.15442 · v1 · submitted 2025-06-18 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material

Authors on Pith no claims yet
Pith Number pith:SHXQOHK4 state: computed view record JSON
4 claims · 47 references · 2 theorem links. This is the computed registry record for this paper; it is not author-attested yet.

Pith reviewed 2026-05-17 23:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D generationAIGCPBR materialstexture synthesisshape generationDiT model3D assetstutorial
0
0 comments X

The pith

Hunyuan3D 2.1 generates high-fidelity 3D assets with production-ready PBR materials from images using two dedicated models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Hunyuan3D 2.1 as a tutorial case study for building 3D generative AI systems. It describes a pipeline that starts with data preparation and leads to shape generation and texture synthesis. The system separates the tasks into Hunyuan3D-DiT for creating the 3D shape and Hunyuan3D-Paint for applying detailed textures and PBR materials. Readers care because this makes advanced 3D model creation more approachable for fields like gaming, film, and design by providing concrete steps for training and evaluation. The goal is to equip users with the ability to fine-tune or create their own robust 3D models.

Core claim

Hunyuan3D 2.1 is an advanced system for producing high-resolution, textured 3D assets from images. It consists of Hunyuan3D-DiT for shape generation and Hunyuan3D-Paint for texture synthesis. The tutorial walks through data preparation, model architecture, training strategies, evaluation metrics, and deployment to allow replication of the process for applications in gaming, virtual reality, and industrial design.

What carries the argument

The two core components: Hunyuan3D-DiT which generates the base 3D shape from input data and Hunyuan3D-Paint which synthesizes textures and PBR materials onto the shape.

If this is right

  • Developers can follow the workflow to create custom 3D generative models for specific industries.
  • The separation of shape and texture tasks enables higher quality outputs in each stage.
  • Evaluation metrics provide ways to measure the fidelity and material accuracy of generated assets.
  • Deployment considerations allow integration into production pipelines for VR and design software.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such systems could democratize 3D content creation beyond professional studios.
  • Combining image inputs with this pipeline might speed up prototyping in product design.
  • Extensions to handle video inputs could enable dynamic 3D scene generation in the future.

Load-bearing premise

The data preparation, architecture choices, and training strategies outlined will reliably lead to high-resolution 3D assets that include accurate production-ready PBR materials.

What would settle it

A test where the trained Hunyuan3D-DiT and Hunyuan3D-Paint models are applied to new images and the resulting 3D models fail to render with realistic PBR properties under standard lighting conditions.

read the original abstract

3D AI-generated content (AIGC) is a passionate field that has significantly accelerated the creation of 3D models in gaming, film, and design. Despite the development of several groundbreaking models that have revolutionized 3D generation, the field remains largely accessible only to researchers, developers, and designers due to the complexities involved in collecting, processing, and training 3D models. To address these challenges, we introduce Hunyuan3D 2.1 as a case study in this tutorial. This tutorial offers a comprehensive, step-by-step guide on processing 3D data, training a 3D generative model, and evaluating its performance using Hunyuan3D 2.1, an advanced system for producing high-resolution, textured 3D assets. The system comprises two core components: the Hunyuan3D-DiT for shape generation and the Hunyuan3D-Paint for texture synthesis. We will explore the entire workflow, including data preparation, model architecture, training strategies, evaluation metrics, and deployment. By the conclusion of this tutorial, you will have the knowledge to finetune or develop a robust 3D generative model suitable for applications in gaming, virtual reality, and industrial design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Hunyuan3D 2.1 as a tutorial case study for 3D AIGC, describing a workflow to generate high-resolution, textured 3D assets with production-ready PBR materials from images. The system has two core components: Hunyuan3D-DiT for shape generation and Hunyuan3D-Paint for texture synthesis. It covers data preparation, model architecture, training strategies, evaluation metrics, and deployment, with the goal of enabling readers to finetune or develop similar 3D generative models for gaming, VR, and design applications.

Significance. If the described data preparation, architectures, and training strategies demonstrably yield high-fidelity PBR assets, the tutorial could offer practical value as a step-by-step guide for practitioners. However, the lack of any reported quantitative benchmarks, ablations, or baseline comparisons in the manuscript substantially reduces its potential impact as a scientific contribution in computer vision.

major comments (2)
  1. [Abstract and Evaluation Metrics section] The central claim that the workflow produces 'high-resolution, production-ready 3D assets with PBR materials' is load-bearing yet unsupported: the abstract and tutorial description outline evaluation metrics but report no numerical results, tables of metrics, or comparisons to prior methods, leaving the performance assertions unverifiable.
  2. [Model Architecture and Training Strategies] § on Hunyuan3D-DiT and Hunyuan3D-Paint: the tutorial framing assumes that the outlined architectures and training strategies will reliably achieve the stated quality, but without any ablation studies or failure-case analysis this assumption remains untested and central to the tutorial's utility.
minor comments (1)
  1. [Introduction] The manuscript would benefit from clearer distinction between tutorial instructions and any novel technical contributions, as the current presentation blurs the line between educational content and research claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential practical value of the tutorial. We clarify that the manuscript is framed as a case study and step-by-step guide rather than a benchmark-focused research paper. We address the concerns below and will incorporate revisions to strengthen the presentation of results and design rationale.

read point-by-point responses
  1. Referee: [Abstract and Evaluation Metrics section] The central claim that the workflow produces 'high-resolution, production-ready 3D assets with PBR materials' is load-bearing yet unsupported: the abstract and tutorial description outline evaluation metrics but report no numerical results, tables of metrics, or comparisons to prior methods, leaving the performance assertions unverifiable.

    Authors: We acknowledge that specific numerical results and baseline comparisons are not reported in the current version. The manuscript is intended as a tutorial describing the complete workflow, data preparation, architectures, and deployment process using Hunyuan3D 2.1 as the running example. To address this, we will revise the abstract to emphasize the tutorial nature and add a dedicated subsection with qualitative examples, rendered outputs, and references to the quantitative evaluations published in the associated technical reports and model releases. This will make the performance claims more verifiable while preserving the tutorial focus. revision: yes

  2. Referee: [Model Architecture and Training Strategies] § on Hunyuan3D-DiT and Hunyuan3D-Paint: the tutorial framing assumes that the outlined architectures and training strategies will reliably achieve the stated quality, but without any ablation studies or failure-case analysis this assumption remains untested and central to the tutorial's utility.

    Authors: We agree that ablation studies and failure-case analysis would increase the tutorial's utility. The current draft prioritizes describing the final architectures and training strategies that produced production-ready results. We will add a concise discussion of key design choices, observed sensitivities during training, and common failure modes (such as geometry artifacts or texture inconsistencies) drawn from our development process. This addition will be included in the revised manuscript without expanding the scope beyond a practical guide. revision: partial

Circularity Check

0 steps flagged

No derivation chain or fitted predictions present in tutorial framing

full rationale

The manuscript is explicitly positioned as a tutorial and case study describing data preparation, Hunyuan3D-DiT architecture, Hunyuan3D-Paint texture synthesis, training strategies, evaluation metrics, and deployment. No equations, mathematical derivations, parameter fits, or predictive claims that could reduce to inputs by construction appear in the provided abstract or described content. Central claims concern workflow utility for producing high-resolution PBR assets but rest on descriptive process rather than any self-referential reduction, self-citation load-bearing step, or ansatz smuggling. This is the expected honest non-finding for a tutorial-style paper without a derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, axioms, or invented entities beyond naming the two model components; all technical details are deferred to the full tutorial content.

pith-pipeline@v0.9.0 · 5721 in / 1004 out tokens · 33200 ms · 2026-05-17T23:05:38.147652+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

    cs.CR 2026-05 conditional novelty 8.0

    Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...

  2. Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

    cs.CV 2026-04 unverdicted novelty 8.0

    The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

  3. Pixal3D: Pixel-Aligned 3D Generation from Images

    cs.CV 2026-05 unverdicted novelty 6.0

    Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.

  4. DVD: Discrete Voxel Diffusion for 3D Generation and Editing

    cs.CV 2026-05 unverdicted novelty 6.0

    DVD treats voxel occupancy as a discrete variable in a diffusion framework to generate, assess, and edit sparse 3D voxels without continuous thresholding.

  5. 3D-ReGen: A Unified 3D Geometry Regeneration Framework

    cs.CV 2026-04 unverdicted novelty 6.0

    3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.

  6. Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

    cs.CV 2026-04 unverdicted novelty 6.0

    Sculpt4D generates temporally coherent 4D shapes by integrating a block sparse attention mechanism with time-decaying mask into a pretrained 3D diffusion transformer, achieving SOTA results with 56% less computation.

  7. FurnSet: Exploiting Repeats for 3D Scene Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    FurnSet improves single-view 3D scene reconstruction by using per-object CLS tokens and set-aware self-attention to group and jointly reconstruct repeated object instances, with added scene-object conditioning and lay...

  8. LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows

    cs.CV 2026-04 conditional novelty 6.0

    LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.

  9. UniRecGen: Unifying Multi-View 3D Reconstruction and Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.

  10. SAM 3D: 3Dfy Anything in Images

    cs.CV 2025-11 unverdicted novelty 6.0

    SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.

  11. Syn4D: A Multiview Synthetic 4D Dataset

    cs.CV 2026-05 unverdicted novelty 5.0

    Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.

  12. DataEvolver: Let Your Data Build and Improve Itself via Goal-Driven Loop Agents

    cs.AI 2026-05 unverdicted novelty 5.0

    DataEvolver introduces a reusable framework with generation-time self-correction and validation-time self-expansion loops that improves visual datasets, shown to outperform baselines on an object-rotation task.

  13. Pose-Aware Diffusion for 3D Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.

  14. Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    Hitem3D 2.0 combines multi-view image synthesis with native 3D texture projection to improve completeness, cross-view consistency, and geometry alignment over prior methods.

  15. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

  16. Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details

    cs.CV 2025-06 unverdicted novelty 4.0

    Hunyuan3D 2.5's LATTICE model with 10B parameters generates detailed 3D shapes from images and uses multi-view PBR for textures, outperforming prior methods in fidelity and mesh quality.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 16 Pith papers · 13 internal anchors

  1. [1]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

  2. [2]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  3. [3]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024

  4. [4]

    Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...

  5. [5]

    Hunyuanvideo: A systematic framework for large video generative models, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

  6. [6]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  7. [7]

    3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models

    Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. ACM Transactions On Graphics (TOG), 42(4):1–16, 2023

  8. [8]

    Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation

    Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao Chen, Gang Yu, and Shenghua Gao. Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation. Advances in neural information processing systems, 36:73969–73982, 2023

  9. [9]

    TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

    Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models. arXiv preprint arXiv:2502.06608, 2025

  10. [10]

    Scaling mesh generation via compressive tokenization

    Haohan Weng, Zibo Zhao, Biwen Lei, Xianghui Yang, Jian Liu, Zeqiang Lai, Zhuo Chen, Yuhong Liu, Jie Jiang, Chunchao Guo, et al. Scaling mesh generation via compressive tokenization. arXiv preprint arXiv:2411.07025, 2024

  11. [11]

    Clay: A controllable large-scale generative model for creating high-quality 3d assets

    Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creating high-quality 3d assets. ACM Transactions on Graphics (TOG), 43(4):1–20, 2024

  12. [12]

    Flexitex: Enhancing texture generation with visual guidance

    DaDong Jiang, Xianghui Yang, Zibo Zhao, Sheng Zhang, Jiaao Yu, Zeqiang Lai, Shaoxiong Yang, Chunchao Guo, Xiaobo Zhou, and Zhihui Ke. Flexitex: Enhancing texture generation with visual guidance. arXiv preprint arXiv:2409.12431, 2024

  13. [13]

    Text-guided texturing by synchronized multi-view diffusion

    Yuxin Liu, Minshan Xie, Hanyuan Liu, and Tien-Tsin Wong. Text-guided texturing by synchronized multi-view diffusion. In SIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 12

  14. [14]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  15. [15]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

  16. [16]

    LRM: Large Reconstruction Model for Single Image to 3D

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023

  17. [17]

    Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation

    Xianghui Yang, Huiwen Shi, Bowen Zhang, Fan Yang, Jiacheng Wang, Hongxu Zhao, Xinhai Liu, Xinzhou Wang, Qingxiang Lin, Jiaao Yu, et al. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation. arXiv preprint arXiv:2411.02293, 2024

  18. [18]

    InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191, 2024

  19. [19]

    Viewfusion: Towards multi-view consistency via interpolated denoising

    Xianghui Yang, Yan Zuo, Sameera Ramasinghe, Loris Bazzani, Gil Avraham, and Anton van den Hengel. Viewfusion: Towards multi-view consistency via interpolated denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9870–9880, 2024

  20. [20]

    Consistent123: Improve consistency for one image to 3d object synthesis

    Haohan Weng, Tianyu Yang, Jianan Wang, Yu Li, Tong Zhang, CL Chen, and Lei Zhang. Consistent123: Improve consistency for one image to 3d object synthesis. arXiv preprint arXiv:2310.08092, 2023

  21. [21]

    SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

    Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023

  22. [22]

    Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

    Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023

  23. [23]

    Zero- 1-to-3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero- 1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023

  24. [24]

    Craftsman: High-fidelity mesh generation with 3d native generation and interactive geometry refiner.arXiv preprint arXiv:2405.14979, 2024

    Weiyu Li, Jiarui Liu, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman: High-fidelity mesh generation with 3d native generation and interactive geometry refiner.arXiv preprint arXiv:2405.14979, 2024

  25. [25]

    Structured 3D Latents for Scalable and Versatile 3D Generation

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. arXiv preprint arXiv:2412.01506, 2024

  26. [26]

    Step1x-3d: Towards high-fidelity and controllable generation of textured 3d assets

    Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, et al. Step1x-3d: Towards high-fidelity and controllable generation of textured 3d assets. arXiv preprint arXiv:2505.07747, 2025

  27. [27]

    Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention

    Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Philip Torr, Xun Cao, and Yao Yao. Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention. arXiv preprint arXiv:2505.17412, 2025

  28. [28]

    ShapeNet: An Information-Rich 3D Model Repository

    Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015

  29. [29]

    3d shapenets: A deep representation for volumetric shapes

    Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015

  30. [30]

    Thingi10K: A Dataset of 10,000 3D-Printing Models

    Qingnan Zhou and Alec Jacobson. Thingi10k: A dataset of 10,000 3d-printing models. arXiv preprint arXiv:1605.04797, 2016

  31. [31]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023. 13

  32. [32]

    Objaverse-xl: A universe of 10m+ 3d objects

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems, 36:35799–35813, 2023

  33. [33]

    Dora: Sampling and benchmarking for 3d shape variational auto-encoders

    Rui Chen, Jianfeng Zhang, Yixun Liang, Guan Luo, Weiyu Li, Jiarui Liu, Xiu Li, Xiaoxiao Long, Jiashi Feng, and Ping Tan. Dora: Sampling and benchmarking for 3d shape variational auto-encoders. arXiv preprint arXiv:2412.17808, 2024

  34. [34]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  35. [35]

    Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code, 2024

  36. [36]

    Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

    Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, et al. Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation. arXiv preprint arXiv:2501.12202, 2025

  37. [37]

    Physically-based shading at disney

    Brent Burley and Walt Disney Animation Studios. Physically-based shading at disney. In ACM Siggraph, volume 2012, pages 1–7. vol. 2012, 2012

  38. [38]

    Materialmvp: Illumination-invariant material generation via multi-view pbr diffusion, 2025

    Zebin He, Mingxin Yang, Shuhui Yang, Yixuan Tang, Tao Wang, Kaihao Zhang, Guanying Chen, Yuhong Liu, Jie Jiang, Chunchao Guo, and Wenhan Luo. Materialmvp: Illumination-invariant material generation via multi-view pbr diffusion, 2025

  39. [39]

    Romantex: Decoupling 3d-aware rotary positional embedded multi-attention network for texture synthesis, 2025

    Yifei Feng, Mingxin Yang, Shuhui Yang, Sheng Zhang, Jiaao Yu, Zibo Zhao, Yuhong Liu, Jie Jiang, and Chunchao Guo. Romantex: Decoupling 3d-aware rotary positional embedded multi-attention network for texture synthesis, 2025

  40. [40]

    Common diffusion noise schedules and sample steps are flawed

    Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF winter Conference on Applications of Computer Vision, pages 5404–5411, 2024

  41. [41]

    Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

    Le Xue, Mingfei Gao, Chen Xing, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1179–1189, 2023

  42. [42]

    Uni3d: Exploring unified 3d representation at scale

    Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale. In The Twelfth International Conference on Learning Representations

  43. [43]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  44. [44]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021

  45. [45]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  46. [46]

    Texgen: a generative diffusion model for mesh textures

    Xin Yu, Ze Yuan, Yuan-Chen Guo, Ying-Tian Liu, Jianhui Liu, Yangguang Li, Yan-Pei Cao, Ding Liang, and Xiaojuan Qi. Texgen: a generative diffusion model for mesh textures. ACM Transactions on Graphics, 43(6):1–14, 2024

  47. [47]

    3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion

    Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, et al. 3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion. arXiv preprint arXiv:2409.12957, 2024. 14