pith. sign in

arxiv: 2504.10466 · v2 · submitted 2025-04-14 · 💻 cs.CV

Art3D: Training-Free 3D Generation from Flat-Colored Illustration

Pith reviewed 2026-05-22 19:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords training-free 3D generationflat-colored illustrationimage-to-3D3D illusion enhancementVLM evaluationFlat-2D datasetart content creation
0
0 comments X

The pith

Art3D adds 3D illusion cues to flat-colored illustrations using features from pre-trained 2D models and VLM checks so existing generators can produce 3D assets from drawings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Art3D, a training-free technique that improves how image-to-3D models handle flat-colored inputs such as hand drawings. It extracts structural and semantic features from existing 2D image generation models and applies a vision-language model to evaluate and refine realism. This step creates stronger depth perception in the reference image before it enters any 3D generator. The authors also release the Flat-2D dataset of over 100 samples to test generalization on non-photorealistic art. If the approach works as described, creators gain a simple way to move from 2D sketches to 3D without style-specific retraining.

Core claim

Art3D lifts flat-colored 2D designs into 3D by enhancing the three-dimensional illusion in reference images. It does so through structural and semantic features drawn from pre-trained 2D image generation models together with a VLM-based realism evaluation. The method simplifies 3D generation from 2D inputs and adapts across a wide range of painting styles. The authors introduce the Flat-2D dataset to measure how current image-to-3D models perform on inputs that lack depth cues.

What carries the argument

Extraction of structural and semantic features from pre-trained 2D image generation models combined with VLM-based realism evaluation to insert plausible 3D cues into flat-colored references.

If this is right

  • Existing image-to-3D models produce higher-quality assets from artistic flat inputs after the enhancement step.
  • The same pipeline works on many painting styles without any additional training or fine-tuning.
  • Artists and designers can convert 2D illustrations to 3D more directly in their existing workflows.
  • The Flat-2D dataset provides a standard test set for measuring generalization on non-photorealistic references.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cue-insertion idea could serve as a preprocessing step for other 2D-to-3D tasks such as depth estimation or novel-view synthesis from sketches.
  • Design pipelines might adopt the enhancement as a default first stage, letting users stay in 2D tools longer before committing to 3D.
  • Extending the feature extraction to abstract or highly stylized art could test whether the method scales beyond the styles tested in the paper.

Load-bearing premise

Structural and semantic features from pre-trained 2D models plus VLM realism checks can add believable 3D cues to flat inputs without creating artifacts that harm later 3D generation.

What would settle it

Feeding Art3D-processed flat-colored images into multiple image-to-3D models and observing no gain or a drop in final 3D asset quality relative to the original flat images would disprove the central claim.

Figures

Figures reproduced from arXiv: 2504.10466 by Jiayi Shen, Rao Fu, Srinath Sridhar, Tao Lu, Xiaoyan Cong, Zekun Li.

Figure 1
Figure 1. Figure 1: Our Art3D creates high-quality 3D assets from a single flat-colored illustration and can be adapted to various drawing styles. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: We compare the generation results from In [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline. Art3D adds 3D illusion to the flat-colored image through ControlNet [42] based on the structure features, e.g., canny edge or depth map. The mesh is then synthesized based on the proxy image and textured by baking information from the input image. ask ’Which image do you think is the most realistic and shows the most 3D feeling?’. ˆI = VQA({Ri} N i=1). (2) Numerous proxy conditions are available,… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons. We combine our augmentation module with InstantMesh [40] and perform comparisons with state￾of-the-art pre-trained image-to-3D methods Shap-E [14], LN3Diff [16], InstantMesh [40], 3DTopia-XL [5], LGM [31] and Trellis [38]. All the flat-shaped input images are from our curated dataset Flat-2D. Our augmentation module can significantly improve the geometry quality of generated 3D ass… view at source ↗
Figure 1
Figure 1. Figure 1: Image Gallery of Flat-2D. 1 arXiv:2504.10466v1 [cs.CV] 14 Apr 2025 [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Image Gallery of Flat-2D [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Image Gallery of Flat-2D. 2 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Large-scale pre-trained image-to-3D generative models have exhibited remarkable capabilities in diverse shape generations. However, most of them struggle to synthesize plausible 3D assets when the reference image is flat-colored like hand drawings due to the lack of 3D illusion, which are often the most user-friendly input modalities in art content creation. To this end, we propose Art3D, a training-free method that can lift flat-colored 2D designs into 3D. By leveraging structural and semantic features with pre-trained 2D image generation models and a VLM-based realism evaluation, Art3D successfully enhances the three-dimensional illusion in reference images, thus simplifying the process of generating 3D from 2D, and proves adaptable to a wide range of painting styles. To benchmark the generalization performance of existing image-to-3D models on flat-colored images without 3D feeling, we collect a new dataset, Flat-2D, with over 100 samples. Experimental results demonstrate the performance and robustness of Art3D, exhibiting superior generalizable capacity and promising practical applicability. Our source code and dataset will be publicly available on our project page: https://joy-jy11.github.io/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Art3D, a training-free method to lift flat-colored 2D designs into 3D by enhancing the three-dimensional illusion using structural and semantic features from pre-trained 2D image generation models and a VLM-based realism evaluation. It collects a new Flat-2D dataset with over 100 samples to benchmark image-to-3D models on such inputs and reports experimental results showing superior generalizable capacity and adaptability to various painting styles.

Significance. If validated, this approach could significantly simplify 3D asset creation from user-friendly artistic inputs like hand drawings, with broad applicability in art content creation. The public release of code and the Flat-2D dataset would be valuable contributions for reproducibility and further benchmarking in the field.

major comments (2)
  1. [Abstract] The abstract asserts successful enhancement and robustness yet supplies no quantitative metrics, ablation studies, or concrete description of how the feature extraction and VLM loop are implemented, so it is impossible to judge whether the reported outcomes actually support the stated claims.
  2. [Method and Experiments] The reliance on VLM for realism evaluation lacks safeguards against subtle geometric inconsistencies (e.g., depth ordering errors or perspective-violating shading) that may not be detected by the VLM but would break downstream image-to-3D models; no explicit 3D consistency enforcement or mesh-level validation is described.
minor comments (1)
  1. [Abstract] Consider adding a brief mention of the specific pre-trained models used for feature extraction to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify our contributions. We address each major comment point-by-point below, indicating planned revisions where appropriate. Our responses focus on substance and aim to strengthen the manuscript without overstating current results.

read point-by-point responses
  1. Referee: [Abstract] The abstract asserts successful enhancement and robustness yet supplies no quantitative metrics, ablation studies, or concrete description of how the feature extraction and VLM loop are implemented, so it is impossible to judge whether the reported outcomes actually support the stated claims.

    Authors: We agree that the abstract, as a concise summary, does not include specific metrics or implementation details, which are instead provided in Sections 3 and 4. Section 3.1 describes the structural and semantic feature extraction from pre-trained 2D models, while Section 3.2 details the iterative VLM-based realism evaluation loop. Experimental results in Section 4 include quantitative comparisons on the Flat-2D dataset and qualitative ablations. To address the concern, we will revise the abstract to briefly reference the key performance gains (e.g., improved generalization across painting styles) and outline the core pipeline components, while maintaining brevity. revision: yes

  2. Referee: [Method and Experiments] The reliance on VLM for realism evaluation lacks safeguards against subtle geometric inconsistencies (e.g., depth ordering errors or perspective-violating shading) that may not be detected by the VLM but would break downstream image-to-3D models; no explicit 3D consistency enforcement or mesh-level validation is described.

    Authors: This is a fair observation. Our VLM evaluation targets perceptual 3D illusion and realism in the enhanced 2D image, which serves as input to existing image-to-3D models; however, it does not include explicit geometric checks for issues such as depth ordering or shading consistency, nor mesh-level validation, as the method is designed to be training-free and operate in image space. We will add a dedicated limitations paragraph in the revised manuscript discussing these potential failure modes and their impact on downstream 3D generation. We will also include additional qualitative examples illustrating cases where subtle inconsistencies may persist. revision: partial

Circularity Check

0 steps flagged

No circularity: training-free composition of external pre-trained models

full rationale

The paper describes a training-free pipeline that applies structural and semantic features from off-the-shelf pre-trained 2D image generation models together with an external VLM-based realism evaluator to enhance 3D illusion in flat-colored inputs. No equations, fitted parameters, or internal predictions are defined within the paper; the Flat-2D dataset serves only for external benchmarking. The central claims rest on the independent capabilities of cited pre-trained models rather than any self-definitional reduction, self-citation load-bearing step, or ansatz smuggled via prior work by the same authors. The derivation chain is therefore self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on the untested premise that existing 2D pre-trained models already encode the right cues for flat art and that a VLM can serve as a reliable proxy for 3D plausibility; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption Pre-trained 2D image generation models supply structural and semantic features that can be repurposed to add 3D illusion to flat-colored inputs.
    The enhancement step depends on this transfer of features without additional training.
  • domain assumption A vision-language model can accurately evaluate and guide the realism of the enhanced 2D image for subsequent 3D generation.
    The VLM-based realism evaluation is presented as the key control mechanism.

pith-pipeline@v0.9.0 · 5761 in / 1458 out tokens · 80962 ms · 2026-05-22T19:36:17.458905+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement

    cs.CV 2026-04 unverdicted novelty 6.0

    REVIVE 3D generates voluminous 3D assets from flat 2D images via an inflated prior construction followed by latent-space refinement, plus new metrics for volume and flatness validated by user study.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    https : //openai.com/index/gpt- 4v- system- card/ ,

    OpenAI: Gpt 4v(ision) system card (2023). https : //openai.com/index/gpt- 4v- system- card/ ,

  2. [2]

    Polydiff: Generating 3d polygonal meshes with diffusion models

    Antonio Alliegro, Yawar Siddiqui, Tatiana Tommasi, and Matthias Nießner. Polydiff: Generating 3d polygonal meshes with diffusion models. arXiv preprint arXiv:2312.11417 ,

  3. [3]

    A computational approach to edge detection

    John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelli- gence, (6):679–698, 1986. 3

  4. [4]

    Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation

    Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation. In Proceedings of the IEEE/CVF international conference on computer vision , pages 22246–22256, 2023. 2

  5. [5]

    3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion, 2024

    Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, Liang Pan, Dahua Lin, and Ziwei Liu. 3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion, 2024. 1, 3, 4

  6. [6]

    Sdfusion: Multimodal 3d shape completion, reconstruction, and generation

    Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, Alexan- der G Schwing, and Liang-Yan Gui. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4456–4465, 2023. 2

  7. [7]

    Diffusion-sdf: Conditional generative modeling of signed distance func- tions

    Gene Chou, Yuval Bahat, and Felix Heide. Diffusion-sdf: Conditional generative modeling of signed distance func- tions. In Proceedings of the IEEE/CVF international con- ference on computer vision, pages 2262–2272, 2023. 2

  8. [8]

    Automatic controllable colorization via imagination

    Xiaoyan Cong, Yue Wu, Qifeng Chen, and Chenyang Lei. Automatic controllable colorization via imagination. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2609–2619, 2024. 1

  9. [9]

    4drecons: 4d neural implicit deformable objects reconstruction from a single rgb- d camera with geometrical and topological regularizations,

    Xiaoyan Cong, Haitao Yang, Liyan Chen, Kaifeng Zhang, Li Yi, Chandrajit Bajaj, and Qixing Huang. 4drecons: 4d neural implicit deformable objects reconstruction from a single rgb- d camera with geometrical and topological regularizations,

  10. [10]

    Objaverse: A universe of annotated 3d objects, 2022

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects, 2022. 2

  11. [11]

    Diffusion models beat gans on image synthesis, 2021

    Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis, 2021. 1

  12. [12]

    3dgen: Triplane latent diffusion for textured mesh generation

    Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Bar- las O˘guz. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023. 2

  13. [13]

    Denoising diffu- sion probabilistic models, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models, 2020. 1

  14. [14]

    Shap-e: Generating condi- tional 3d implicit functions, 2023

    Heewoo Jun and Alex Nichol. Shap-e: Generating condi- tional 3d implicit functions, 2023. 2, 3, 4

  15. [15]

    Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2024. 2

  16. [16]

    Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation

    Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, and Chen Change Loy. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. In ECCV, 2024. 1, 3, 4

  17. [17]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In In- ternational conference on machine learning , pages 19730– 19742. PMLR, 2023. 2

  18. [18]

    Magic3d: High-resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 300–309, 2023. 2

  19. [19]

    One-2-3-45: Any single image to 3d mesh in 45 seconds without per- shape optimization, 2023

    Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per- shape optimization, 2023. 2

  20. [20]

    Meshdif- fusion: Score-based generative 3d mesh modeling

    Zhen Liu, Yao Feng, Michael J Black, Derek Nowrouzezahrai, Liam Paull, and Weiyang Liu. Meshdif- fusion: Score-based generative 3d mesh modeling. arXiv preprint arXiv:2303.08133, 2023. 2

  21. [21]

    Wonder3d: Sin- gle image to 3d using cross-domain diffusion.arXiv preprint arXiv:2310.15008, 2023

    Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin- gle image to 3d using cross-domain diffusion.arXiv preprint arXiv:2310.15008, 2023. 2

  22. [22]

    Pc2: Projection-conditioned point cloud diffu- sion for single-image 3d reconstruction

    Luke Melas-Kyriazi, Christian Rupprecht, and Andrea Vedaldi. Pc2: Projection-conditioned point cloud diffu- sion for single-image 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12923–12932, 2023. 2

  23. [23]

    Diffrf: Rendering-guided 3d radiance field diffusion

    Norman M ¨uller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4328–4338, 2023. 2

  24. [24]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generat- ing 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022. 1, 2

  25. [25]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. 2

  26. [26]

    Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors, 2023

    Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Sko- rokhodov, Peter Wonka, Sergey Tulyakov, and Bernard Ghanem. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors, 2023. 2 5

  27. [27]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer, 2020

    Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer, 2020. 3

  28. [28]

    Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911, 2021. 2

  29. [29]

    Diffusion-based signed distance fields for 3d shape gener- ation

    Jaehyeok Shim, Changwoo Kang, and Kyungdon Joo. Diffusion-based signed distance fields for 3d shape gener- ation. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 20887–20897,

  30. [30]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. arXiv preprint arXiv:2011.13456, 2020. 1

  31. [31]

    Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024

    Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024. 1, 2, 3, 4

  32. [32]

    Gecco: Geometrically-conditioned point diffusion models

    Michał J Tyszkiewicz, Pascal Fua, and Eduard Trulls. Gecco: Geometrically-conditioned point diffusion models. In Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 2128–2138, 2023. 2

  33. [33]

    Yeh, and Greg Shakhnarovich

    Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12619–12629, 2023. 2

  34. [34]

    Rodin: A generative model for sculpting 3d digital avatars using diffusion

    Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, et al. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4563–4573, 2023. 2

  35. [35]

    Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion, 2023

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion, 2023. 2

  36. [36]

    Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation

    Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Liang Pan Jiawei Ren, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, Dahua Lin, and Ziwei Liu. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2

  37. [37]

    Sketch and text guided diffusion model for colored point cloud generation

    Zijie Wu, Yaonan Wang, Mingtao Feng, He Xie, and Ajmal Mian. Sketch and text guided diffusion model for colored point cloud generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8929– 8939, 2023. 2

  38. [38]

    Structured 3D Latents for Scalable and Versatile 3D Generation

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. arXiv preprint arXiv:2412.01506, 2024. 1, 2, 3, 4

  39. [39]

    Holistically-nested edge de- tection

    Saining Xie and Zhuowen Tu. Holistically-nested edge de- tection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015. 3

  40. [40]

    Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models, 2024

    Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models, 2024. 1, 2, 3, 4

  41. [41]

    3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023

    Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023. 2

  42. [42]

    Adding conditional control to text-to-image diffusion models, 2023

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 2, 3

  43. [43]

    Hunyuan3d 2.0: Scal- ing diffusion models for high resolution textured 3d assets generation, 2025

    Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Hao- han Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Shaoxiong Yang, ...

  44. [44]

    Locally attentional sdf diffusion for controllable 3d shape generation

    Xin-Yang Zheng, Hao Pan, Peng-Shuai Wang, Xin Tong, Yang Liu, and Heung-Yeung Shum. Locally attentional sdf diffusion for controllable 3d shape generation. ACM Trans- actions on Graphics (ToG), 42(4):1–13, 2023. 2

  45. [45]

    Oscilla- tion inversion: Understand the structure of large flow model through the lens of inversion method, 2024

    Yan Zheng, Zhenxiao Liang, Xiaoyan Cong, Lanqing guo, Yuehao Wang, Peihao Wang, and Zhangyang Wang. Oscilla- tion inversion: Understand the structure of large flow model through the lens of inversion method, 2024. 1

  46. [46]

    3d shape generation and completion through point-voxel diffusion

    Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. In Proceed- ings of the IEEE/CVF international conference on computer vision, pages 5826–5835, 2021. 2 6 Art3D: Training-Free 3D Generation from Flat-Colored Illustration Supplementary Material Supplemental materials provide more implementation details o...

  47. [47]

    Implementation Details In experiments, we use as reference conditions N1 = 2 canny edge maps and N2 = 2 depth maps. It is noticeable that Art3D can present superior and robust performance with only one canny edge map as the reference condition, making Art3D more efficient and convenient to be applied to downstream tasks

  48. [48]

    Art3D: Training-Free 3D Generation from Flat-Colored Illustration

    Dataset We show some examples of our newly collected dataset Flat-2D in Figure 2, Figure 2, and Figure 3. Figure 1. Image Gallery of Flat-2D. 1 arXiv:2504.10466v1 [cs.CV] 14 Apr 2025 Figure 2. Image Gallery of Flat-2D. Figure 3. Image Gallery of Flat-2D. 2