arxiv: 2310.15110 · v1 · submitted 2023-10-23 · 💻 cs.CV · cs.GR

Recognition: 2 theorem links

· Lean Theorem

Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

Ruoxi Shi , Hansheng Chen , Zhuoyang Zhang , Minghua Liu , Chao Xu , Xinyue Wei , Linghao Chen , Chong Zeng

show 1 more author

Hao Su

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:02 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords diffusion modelsmulti-view generationsingle image to 3Dview consistencyStable Diffusionimage conditioninggenerative priorsControlNet

0 comments

The pith

Targeted conditioning on Stable Diffusion lets a model turn one image into geometrically consistent multi-view outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Zero123++ as an image-conditioned diffusion model that produces multiple views of the same 3D object from a single input photo. It reaches this by layering specific conditioning signals and training adjustments onto existing models such as Stable Diffusion, so that only modest finetuning is required. The goal is to eliminate the texture loss and view-to-view misalignment that usually appear when diffusion models are asked to synthesize additional angles. A reader would care because reliable multi-view output from one photo removes a major bottleneck in turning casual pictures into usable 3D assets for rendering, simulation, or reconstruction pipelines.

Core claim

Zero123++ is an image-conditioned diffusion model for generating 3D-consistent multi-view images from a single input view. Various conditioning and training schemes are developed to minimize the effort of finetuning from off-the-shelf image diffusion models such as Stable Diffusion. The resulting model produces high-quality outputs that avoid texture degradation and geometric misalignment, and it can serve as the base for a ControlNet that adds further user control over the generated views.

What carries the argument

Conditioning and training schemes that adapt a pretrained 2D diffusion model to enforce cross-view geometric and texture consistency.

If this is right

Multi-view images can be produced at higher quality than prior single-image methods without heavy retraining.
Geometric consistency across views improves suitability for downstream 3D reconstruction tasks.
A ControlNet can be trained on top of the model to add explicit control over pose or style.
The same minimal-finetuning recipe can be applied to other pretrained 2D diffusion bases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may transfer to newer diffusion backbones, yielding even stronger priors with the same conditioning overhead.
Consistent multi-view sets could directly feed into neural radiance field or Gaussian splatting pipelines with less post-processing.
Limits of the consistency might appear on object categories with complex transparency or thin structures not emphasized in the examples.
If the conditioning generalizes, it offers a low-cost route to multi-view data augmentation for other 3D learning problems.

Load-bearing premise

The chosen conditioning signals and training steps applied to Stable Diffusion will keep producing aligned geometry and intact textures across all new inputs without fresh failure modes.

What would settle it

A test set of single-input objects where the generated multi-view images display clear geometric misalignment or texture corruption on viewpoints the model never saw during its reported training.

read the original abstract

We report Zero123++, an image-conditioned diffusion model for generating 3D-consistent multi-view images from a single input view. To take full advantage of pretrained 2D generative priors, we develop various conditioning and training schemes to minimize the effort of finetuning from off-the-shelf image diffusion models such as Stable Diffusion. Zero123++ excels in producing high-quality, consistent multi-view images from a single image, overcoming common issues like texture degradation and geometric misalignment. Furthermore, we showcase the feasibility of training a ControlNet on Zero123++ for enhanced control over the generation process. The code is available at https://github.com/SUDO-AI-3D/zero123plus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Zero123++ is a practical fine-tune of Stable Diffusion that improves multi-view consistency from one image via added conditioning, but the gains stay within 2D priors without explicit 3D enforcement.

read the letter

Zero123++ refines the original Zero123 setup by layering camera pose and reference image conditioning onto Stable Diffusion, plus training tweaks that cut down on how much you have to retrain from scratch. The result is multi-view outputs that look more coherent in the examples, with fewer obvious texture drops or misalignments across views. They also show you can train a ControlNet on top for extra knobs, which is a straightforward extension. Code release helps here too, since it lets others plug it into their pipelines without starting over. That part is useful engineering for anyone already working in image-to-3D diffusion flows. The soft spot is exactly what the stress-test note flags: everything still runs on 2D diffusion priors, so consistency is learned correlation rather than enforced geometry. No volumetric losses or direct 3D signals appear in the approach, which leaves open the chance of drift on complex shapes or novel viewpoints not covered in the qualitative demos. If the full paper's metrics stay mostly 2D image quality scores without strong reconstruction checks like multi-view reprojection error or mesh consistency, the central claim stays partly unproven. This is aimed at CV and graphics groups building single-image 3D pipelines who need a ready base model rather than a theoretical leap. A reader extending diffusion methods for 3D would get concrete training recipes and a starting checkpoint worth testing. It deserves peer review because the contribution is a working, open model with clear practical steps, even if reviewers will want tighter 3D validation.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Zero123++, an image-conditioned diffusion model fine-tuned from Stable Diffusion for generating 3D-consistent multi-view images from a single input view. It develops conditioning schemes (camera pose and reference image) and training procedures to leverage pretrained 2D priors with minimal fine-tuning effort, reports qualitative improvements in avoiding texture degradation and geometric misalignment, and demonstrates training a ControlNet on the model for additional control. Code is released at the provided GitHub link.

Significance. If the consistency claims are substantiated, the work provides a practical base model that efficiently extends 2D diffusion priors to multi-view generation. This could accelerate progress in novel view synthesis and 3D reconstruction pipelines, with the ControlNet extension and open code release adding immediate utility for the community.

major comments (3)

[Abstract and Experiments] Abstract and Experiments section: The central claims of superior consistency and overcoming texture degradation/geometric misalignment rest entirely on qualitative examples. No quantitative metrics (e.g., cross-view PSNR, LPIPS, or consistency scores), ablation studies on the conditioning schemes, or error analysis are provided, leaving the strength of the improvements unverified.
[Method] Method section: The conditioning on camera poses and reference images is presented as inducing 3D consistency, yet the model remains a 2D diffusion process without explicit 3D supervision, epipolar constraints, or volumetric losses. This raises the risk that observed consistency is limited to 2D appearance correlation rather than true geometric fidelity, especially outside the reported examples.
[ControlNet section] ControlNet extension: The claim that a ControlNet can be trained on Zero123++ for enhanced control is stated without details on training data, hyperparameters, or comparative quantitative/qualitative results demonstrating the added value over direct use of Zero123++.

minor comments (2)

[Abstract] The abstract would benefit from briefly stating the fine-tuning dataset and number of views generated per input.
[Figures] Figure captions in the results section should explicitly label the input view and specify the camera poses for each generated output to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each of the major comments below and have updated the paper to incorporate the suggested improvements where feasible.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The central claims of superior consistency and overcoming texture degradation/geometric misalignment rest entirely on qualitative examples. No quantitative metrics (e.g., cross-view PSNR, LPIPS, or consistency scores), ablation studies on the conditioning schemes, or error analysis are provided, leaving the strength of the improvements unverified.

Authors: We appreciate this feedback. The initial version of the paper indeed relied on qualitative examples to illustrate the improvements in consistency and quality. To address this, we have conducted additional quantitative evaluations, including cross-view PSNR, LPIPS, and consistency scores, as well as ablation studies on the conditioning schemes. These results have been added to the revised Experiments section, providing stronger verification of our claims. revision: yes
Referee: [Method] Method section: The conditioning on camera poses and reference images is presented as inducing 3D consistency, yet the model remains a 2D diffusion process without explicit 3D supervision, epipolar constraints, or volumetric losses. This raises the risk that observed consistency is limited to 2D appearance correlation rather than true geometric fidelity, especially outside the reported examples.

Authors: We agree that Zero123++ operates as a 2D diffusion model without explicit 3D supervision or geometric constraints. The consistency is induced through the specific conditioning on camera poses and reference images, combined with training on multi-view data that encourages the model to maintain coherence across views. While this does not guarantee perfect geometric fidelity in all cases, our qualitative results demonstrate practical improvements over baselines. We have added a clarification in the Method section discussing the implicit mechanism and its scope. revision: partial
Referee: [ControlNet section] ControlNet extension: The claim that a ControlNet can be trained on Zero123++ for enhanced control is stated without details on training data, hyperparameters, or comparative quantitative/qualitative results demonstrating the added value over direct use of Zero123++.

Authors: We thank the referee for noting this omission. The original manuscript provided only a high-level description of the ControlNet extension. In the revision, we have included detailed information on the training data, hyperparameters, and additional comparative results (both quantitative and qualitative) to demonstrate the added value of the ControlNet. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical fine-tuning procedure

full rationale

The paper describes an empirical fine-tuning procedure applied to off-the-shelf Stable Diffusion, using conditioning schemes (camera pose, reference image) and training modifications to produce multi-view outputs. No mathematical derivation, first-principles prediction, fitted parameter renamed as prediction, or self-citation chain is present that reduces any claimed result to its own inputs by construction. All claims rest on external qualitative/quantitative evaluations of generated images, so the work is self-contained with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that pretrained 2D diffusion priors plus the described conditioning suffice for 3D consistency; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5433 in / 1011 out tokens · 88820 ms · 2026-05-16T22:02:36.780458+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We adopt a strategy of tiling six views surrounding the object into a single image... fixed absolute elevation angles and relative azimuth angles
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

switch from the scaled-linear schedule to the linear schedule for noise... v-prediction model

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 7.0

R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 7.0

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
cs.CV 2026-04 unverdicted novelty 7.0

A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...
Novel View Synthesis as Video Completion
cs.CV 2026-04 unverdicted novelty 7.0

Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

GeoQuery replaces corrupted rendering features with geometry-aligned proxy queries and restricts cross-view attention to local windows, enabling robust diffusion-based refinement under extreme view sparsity.
PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World
cs.CV 2026-05 unverdicted novelty 6.0

PhysForge generates physics-grounded 3D assets via a VLM-planned Hierarchical Physical Blueprint and a KineVoxel Injection diffusion model, backed by the new PhysDB dataset of 150,000 annotated assets.
Stylistic Attribute Control in Latent Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
Sparse-View 3D Gaussian Splatting in the Wild
cs.CV 2026-04 unverdicted novelty 6.0

A new sparse-view 3D Gaussian splatting method for unconstrained scenes with distractors combines diffusion-based reference-guided refinement and sparsity-aware Gaussian replication to achieve better rendering quality.
Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens
cs.CV 2026-04 unverdicted novelty 6.0

Viewpoint tokens learned on a mixed 3D-rendered and photorealistic dataset enable precise camera control in text-to-image generation while factorizing geometry from appearance and transferring to unseen object categories.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 6.0

UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image
cs.CV 2026-04 unverdicted novelty 6.0

Any3DAvatar reconstructs full-head 3D Gaussian avatars from one image via one-step denoising on a Plücker-aware scaffold plus auxiliary view supervision, beating prior single-image methods on fidelity while running su...
SIC3D: Style Image Conditioned Text-to-3D Gaussian Splatting Generation
cs.CV 2026-04 unverdicted novelty 6.0

SIC3D generates text-to-3D objects with Gaussian splatting then stylizes them using Variational Stylized Score Distillation loss plus scaling regularization to improve style match and geometry fidelity.
Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning
cs.GR 2026-03 conditional novelty 6.0

Realiz3D decouples visual domain from 3D controls in diffusion models via domain-aware residual adapters to enable photorealistic controllable generation.
TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models
cs.CV 2025-02 unverdicted novelty 6.0

TripoSG generates high-fidelity 3D meshes from input images via a large-scale rectified flow transformer and hybrid-trained 3D VAE on a custom 2-million-sample dataset, claiming state-of-the-art fidelity and generalization.
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation
cs.CV 2024-06 unverdicted novelty 6.0

CamCo equips image-to-video generators with Plücker-coordinate camera inputs and epipolar attention to improve 3D consistency and camera controllability.
Evaluating Real-World Robot Manipulation Policies in Simulation
cs.RO 2024-05 conditional novelty 6.0

SIMPLER simulated environments yield policy performance that correlates strongly with real-world robot manipulation results and captures similar sensitivity to distribution shifts.
InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models
cs.CV 2024-04 unverdicted novelty 6.0

InstantMesh produces diverse, high-quality 3D meshes from single images in seconds by combining a multi-view diffusion model with a sparse-view large reconstruction model and optimizing directly on meshes.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 5.0

R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation
cs.CV 2026-04 unverdicted novelty 4.0

AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantica...
Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details
cs.CV 2025-06 unverdicted novelty 4.0

Hunyuan3D 2.5's LATTICE model with 10B parameters generates detailed 3D shapes from images and uses multi-view PBR for textures, outperforming prior methods in fidelity and mesh quality.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 18 Pith papers · 11 internal anchors

[1]

ShapeNet: An Information-Rich 3D Model Repository

Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015. 3

work page internal anchor Pith review Pith/arXiv arXiv 2015
[2]

a cozy sofa in the shape of a llama

Ting Chen. On the importance of noise scheduling for diffu- sion models. arXiv preprint arXiv:2301.10972, 2023. 3 7 “a cozy sofa in the shape of a llama” “a frog wearing black glasses” “a metal bunny sitting on top of a stack of broccoli” Inputs Textured meshes Figure 12. Preliminary mesh generation results with Zero123++. 8

work page arXiv 2023
[3]

Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- tian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023. 5, 7

work page arXiv 2023
[4]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 2, 4

work page 2023
[5]

Efficient diffu- sion training via min-snr weighting strategy

Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffu- sion training via min-snr weighting strategy. arXiv preprint arXiv:2303.09556, 2023. 5

work page arXiv 2023
[6]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. arXiv preprint arXiv:2304.02643, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Stable diffusion image variations

Lambda Labs. Stable diffusion image variations. https: / / huggingface . co / lambdalabs / sd - image - variations-diffusers, 2022. 4

work page 2022
[10]

Consistent123: One image to highly consistent 3d asset using case-aware diffusion priors

Yukang Lin, Haonan Han, Chaoqun Gong, Zunnan Xu, Yachao Zhang, and Xiu Li. Consistent123: One image to highly consistent 3d asset using case-aware diffusion priors. arXiv preprint arXiv:2309.17261, 2023. 2

work page arXiv 2023
[11]

One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization.arXiv preprint arXiv:2306.16928, 2023

Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Zexiang Xu, Hao Su, et al. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization.arXiv preprint arXiv:2306.16928, 2023. 2, 5

work page arXiv 2023
[12]

Zero-1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 9298–9309, 2023. 2, 3, 5

work page 2023
[13]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age. arXiv preprint arXiv:2309.03453, 2023. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 4

work page 2021
[18]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022. 2

work page 2022
[19]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration. arXiv preprint arXiv:2308.16512, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Flexdiffuse: An adaptation of stable diffusion with image guidance

Tim Speed. Flexdiffuse: An adaptation of stable diffusion with image guidance. https://github.com/tim- speed/flexdiffuse, 2022. 4

work page 2022
[22]

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for effi- cient 3d content creation. arXiv preprint arXiv:2309.16653,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion. arXiv preprint arXiv:2305.16213, 2023. 2

work page arXiv 2023
[24]

Reference-only control

Lyumin Zhang. Reference-only control. https : / / github.com/Mikubill/sd-webui-controlnet/ discussions/1236, 2023. 3

work page 2023
[25]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 5

work page 2023
[26]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 5 9

work page 2018