pith. machine review for the scientific record. sign in

arxiv: 2310.15110 · v1 · submitted 2023-10-23 · 💻 cs.CV · cs.GR

Recognition: 2 theorem links

· Lean Theorem

Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:02 UTC · model grok-4.3

classification 💻 cs.CV cs.GR
keywords diffusion modelsmulti-view generationsingle image to 3Dview consistencyStable Diffusionimage conditioninggenerative priorsControlNet
0
0 comments X

The pith

Targeted conditioning on Stable Diffusion lets a model turn one image into geometrically consistent multi-view outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Zero123++ as an image-conditioned diffusion model that produces multiple views of the same 3D object from a single input photo. It reaches this by layering specific conditioning signals and training adjustments onto existing models such as Stable Diffusion, so that only modest finetuning is required. The goal is to eliminate the texture loss and view-to-view misalignment that usually appear when diffusion models are asked to synthesize additional angles. A reader would care because reliable multi-view output from one photo removes a major bottleneck in turning casual pictures into usable 3D assets for rendering, simulation, or reconstruction pipelines.

Core claim

Zero123++ is an image-conditioned diffusion model for generating 3D-consistent multi-view images from a single input view. Various conditioning and training schemes are developed to minimize the effort of finetuning from off-the-shelf image diffusion models such as Stable Diffusion. The resulting model produces high-quality outputs that avoid texture degradation and geometric misalignment, and it can serve as the base for a ControlNet that adds further user control over the generated views.

What carries the argument

Conditioning and training schemes that adapt a pretrained 2D diffusion model to enforce cross-view geometric and texture consistency.

If this is right

  • Multi-view images can be produced at higher quality than prior single-image methods without heavy retraining.
  • Geometric consistency across views improves suitability for downstream 3D reconstruction tasks.
  • A ControlNet can be trained on top of the model to add explicit control over pose or style.
  • The same minimal-finetuning recipe can be applied to other pretrained 2D diffusion bases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may transfer to newer diffusion backbones, yielding even stronger priors with the same conditioning overhead.
  • Consistent multi-view sets could directly feed into neural radiance field or Gaussian splatting pipelines with less post-processing.
  • Limits of the consistency might appear on object categories with complex transparency or thin structures not emphasized in the examples.
  • If the conditioning generalizes, it offers a low-cost route to multi-view data augmentation for other 3D learning problems.

Load-bearing premise

The chosen conditioning signals and training steps applied to Stable Diffusion will keep producing aligned geometry and intact textures across all new inputs without fresh failure modes.

What would settle it

A test set of single-input objects where the generated multi-view images display clear geometric misalignment or texture corruption on viewpoints the model never saw during its reported training.

read the original abstract

We report Zero123++, an image-conditioned diffusion model for generating 3D-consistent multi-view images from a single input view. To take full advantage of pretrained 2D generative priors, we develop various conditioning and training schemes to minimize the effort of finetuning from off-the-shelf image diffusion models such as Stable Diffusion. Zero123++ excels in producing high-quality, consistent multi-view images from a single image, overcoming common issues like texture degradation and geometric misalignment. Furthermore, we showcase the feasibility of training a ControlNet on Zero123++ for enhanced control over the generation process. The code is available at https://github.com/SUDO-AI-3D/zero123plus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Zero123++, an image-conditioned diffusion model fine-tuned from Stable Diffusion for generating 3D-consistent multi-view images from a single input view. It develops conditioning schemes (camera pose and reference image) and training procedures to leverage pretrained 2D priors with minimal fine-tuning effort, reports qualitative improvements in avoiding texture degradation and geometric misalignment, and demonstrates training a ControlNet on the model for additional control. Code is released at the provided GitHub link.

Significance. If the consistency claims are substantiated, the work provides a practical base model that efficiently extends 2D diffusion priors to multi-view generation. This could accelerate progress in novel view synthesis and 3D reconstruction pipelines, with the ControlNet extension and open code release adding immediate utility for the community.

major comments (3)
  1. [Abstract and Experiments] Abstract and Experiments section: The central claims of superior consistency and overcoming texture degradation/geometric misalignment rest entirely on qualitative examples. No quantitative metrics (e.g., cross-view PSNR, LPIPS, or consistency scores), ablation studies on the conditioning schemes, or error analysis are provided, leaving the strength of the improvements unverified.
  2. [Method] Method section: The conditioning on camera poses and reference images is presented as inducing 3D consistency, yet the model remains a 2D diffusion process without explicit 3D supervision, epipolar constraints, or volumetric losses. This raises the risk that observed consistency is limited to 2D appearance correlation rather than true geometric fidelity, especially outside the reported examples.
  3. [ControlNet section] ControlNet extension: The claim that a ControlNet can be trained on Zero123++ for enhanced control is stated without details on training data, hyperparameters, or comparative quantitative/qualitative results demonstrating the added value over direct use of Zero123++.
minor comments (2)
  1. [Abstract] The abstract would benefit from briefly stating the fine-tuning dataset and number of views generated per input.
  2. [Figures] Figure captions in the results section should explicitly label the input view and specify the camera poses for each generated output to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each of the major comments below and have updated the paper to incorporate the suggested improvements where feasible.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The central claims of superior consistency and overcoming texture degradation/geometric misalignment rest entirely on qualitative examples. No quantitative metrics (e.g., cross-view PSNR, LPIPS, or consistency scores), ablation studies on the conditioning schemes, or error analysis are provided, leaving the strength of the improvements unverified.

    Authors: We appreciate this feedback. The initial version of the paper indeed relied on qualitative examples to illustrate the improvements in consistency and quality. To address this, we have conducted additional quantitative evaluations, including cross-view PSNR, LPIPS, and consistency scores, as well as ablation studies on the conditioning schemes. These results have been added to the revised Experiments section, providing stronger verification of our claims. revision: yes

  2. Referee: [Method] Method section: The conditioning on camera poses and reference images is presented as inducing 3D consistency, yet the model remains a 2D diffusion process without explicit 3D supervision, epipolar constraints, or volumetric losses. This raises the risk that observed consistency is limited to 2D appearance correlation rather than true geometric fidelity, especially outside the reported examples.

    Authors: We agree that Zero123++ operates as a 2D diffusion model without explicit 3D supervision or geometric constraints. The consistency is induced through the specific conditioning on camera poses and reference images, combined with training on multi-view data that encourages the model to maintain coherence across views. While this does not guarantee perfect geometric fidelity in all cases, our qualitative results demonstrate practical improvements over baselines. We have added a clarification in the Method section discussing the implicit mechanism and its scope. revision: partial

  3. Referee: [ControlNet section] ControlNet extension: The claim that a ControlNet can be trained on Zero123++ for enhanced control is stated without details on training data, hyperparameters, or comparative quantitative/qualitative results demonstrating the added value over direct use of Zero123++.

    Authors: We thank the referee for noting this omission. The original manuscript provided only a high-level description of the ControlNet extension. In the revision, we have included detailed information on the training data, hyperparameters, and additional comparative results (both quantitative and qualitative) to demonstrate the added value of the ControlNet. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical fine-tuning procedure

full rationale

The paper describes an empirical fine-tuning procedure applied to off-the-shelf Stable Diffusion, using conditioning schemes (camera pose, reference image) and training modifications to produce multi-view outputs. No mathematical derivation, first-principles prediction, fitted parameter renamed as prediction, or self-citation chain is present that reduces any claimed result to its own inputs by construction. All claims rest on external qualitative/quantitative evaluations of generated images, so the work is self-contained with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that pretrained 2D diffusion priors plus the described conditioning suffice for 3D consistency; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5433 in / 1011 out tokens · 88820 ms · 2026-05-16T22:02:36.780458+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 7.0

    R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.

  2. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

  3. Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

    cs.CV 2026-04 unverdicted novelty 7.0

    A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...

  4. Novel View Synthesis as Video Completion

    cs.CV 2026-04 unverdicted novelty 7.0

    Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.

  5. GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    GeoQuery replaces corrupted rendering features with geometry-aligned proxy queries and restricts cross-view attention to local windows, enabling robust diffusion-based refinement under extreme view sparsity.

  6. PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World

    cs.CV 2026-05 unverdicted novelty 6.0

    PhysForge generates physics-grounded 3D assets via a VLM-planned Hierarchical Physical Blueprint and a KineVoxel Injection diffusion model, backed by the new PhysDB dataset of 150,000 annotated assets.

  7. Stylistic Attribute Control in Latent Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.

  8. Sparse-View 3D Gaussian Splatting in the Wild

    cs.CV 2026-04 unverdicted novelty 6.0

    A new sparse-view 3D Gaussian splatting method for unconstrained scenes with distractors combines diffusion-based reference-guided refinement and sparsity-aware Gaussian replication to achieve better rendering quality.

  9. Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

    cs.CV 2026-04 unverdicted novelty 6.0

    Viewpoint tokens learned on a mixed 3D-rendered and photorealistic dataset enable precise camera control in text-to-image generation while factorizing geometry from appearance and transferring to unseen object categories.

  10. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 6.0

    UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.

  11. Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image

    cs.CV 2026-04 unverdicted novelty 6.0

    Any3DAvatar reconstructs full-head 3D Gaussian avatars from one image via one-step denoising on a Plücker-aware scaffold plus auxiliary view supervision, beating prior single-image methods on fidelity while running su...

  12. SIC3D: Style Image Conditioned Text-to-3D Gaussian Splatting Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    SIC3D generates text-to-3D objects with Gaussian splatting then stylizes them using Variational Stylized Score Distillation loss plus scaling regularization to improve style match and geometry fidelity.

  13. Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning

    cs.GR 2026-03 conditional novelty 6.0

    Realiz3D decouples visual domain from 3D controls in diffusion models via domain-aware residual adapters to enable photorealistic controllable generation.

  14. TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

    cs.CV 2025-02 unverdicted novelty 6.0

    TripoSG generates high-fidelity 3D meshes from input images via a large-scale rectified flow transformer and hybrid-trained 3D VAE on a custom 2-million-sample dataset, claiming state-of-the-art fidelity and generalization.

  15. CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

    cs.CV 2024-06 unverdicted novelty 6.0

    CamCo equips image-to-video generators with Plücker-coordinate camera inputs and epipolar attention to improve 3D consistency and camera controllability.

  16. Evaluating Real-World Robot Manipulation Policies in Simulation

    cs.RO 2024-05 conditional novelty 6.0

    SIMPLER simulated environments yield policy performance that correlates strongly with real-world robot manipulation results and captures similar sensitivity to distribution shifts.

  17. InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    cs.CV 2024-04 unverdicted novelty 6.0

    InstantMesh produces diverse, high-quality 3D meshes from single images in seconds by combining a multi-view diffusion model with a sparse-view large reconstruction model and optimizing directly on meshes.

  18. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 5.0

    R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.

  19. AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation

    cs.CV 2026-04 unverdicted novelty 4.0

    AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantica...

  20. Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details

    cs.CV 2025-06 unverdicted novelty 4.0

    Hunyuan3D 2.5's LATTICE model with 10B parameters generates detailed 3D shapes from images and uses multi-view PBR for textures, outperforming prior methods in fidelity and mesh quality.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 18 Pith papers · 11 internal anchors

  1. [1]

    ShapeNet: An Information-Rich 3D Model Repository

    Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015. 3

  2. [2]

    a cozy sofa in the shape of a llama

    Ting Chen. On the importance of noise scheduling for diffu- sion models. arXiv preprint arXiv:2301.10972, 2023. 3 7 “a cozy sofa in the shape of a llama” “a frog wearing black glasses” “a metal bunny sitting on top of a stack of broccoli” Inputs Textured meshes Figure 12. Preliminary mesh generation results with Zero123++. 8

  3. [3]

    Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- tian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023. 5, 7

  4. [4]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 2, 4

  5. [5]

    Efficient diffu- sion training via min-snr weighting strategy

    Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffu- sion training via min-snr weighting strategy. arXiv preprint arXiv:2303.09556, 2023. 5

  6. [6]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 3

  7. [7]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,

  8. [8]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. arXiv preprint arXiv:2304.02643, 2023. 5

  9. [9]

    Stable diffusion image variations

    Lambda Labs. Stable diffusion image variations. https: / / huggingface . co / lambdalabs / sd - image - variations-diffusers, 2022. 4

  10. [10]

    Consistent123: One image to highly consistent 3d asset using case-aware diffusion priors

    Yukang Lin, Haonan Han, Chaoqun Gong, Zunnan Xu, Yachao Zhang, and Xiu Li. Consistent123: One image to highly consistent 3d asset using case-aware diffusion priors. arXiv preprint arXiv:2309.17261, 2023. 2

  11. [11]

    One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization.arXiv preprint arXiv:2306.16928, 2023

    Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Zexiang Xu, Hao Su, et al. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization.arXiv preprint arXiv:2306.16928, 2023. 2, 5

  12. [12]

    Zero-1-to-3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 9298–9309, 2023. 2, 3, 5

  13. [13]

    SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

    Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age. arXiv preprint arXiv:2309.03453, 2023. 2, 5

  14. [14]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 5

  15. [15]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 5, 7

  16. [16]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. 2

  17. [17]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 4

  18. [18]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022. 2

  19. [19]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 3

  20. [20]

    MVDream: Multi-view Diffusion for 3D Generation

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration. arXiv preprint arXiv:2308.16512, 2023. 5

  21. [21]

    Flexdiffuse: An adaptation of stable diffusion with image guidance

    Tim Speed. Flexdiffuse: An adaptation of stable diffusion with image guidance. https://github.com/tim- speed/flexdiffuse, 2022. 4

  22. [22]

    DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

    Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for effi- cient 3d content creation. arXiv preprint arXiv:2309.16653,

  23. [23]

    Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion. arXiv preprint arXiv:2305.16213, 2023. 2

  24. [24]

    Reference-only control

    Lyumin Zhang. Reference-only control. https : / / github.com/Mikubill/sd-webui-controlnet/ discussions/1236, 2023. 3

  25. [25]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 5

  26. [26]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 5 9