CoFi-UCGen: Coarse-to-Fine Unsupervised Conditional Generation without Label Priors

Ce Zheng; Jingyuan Xia; Mai Xu; Shengxi Li; Si Liu; Zhaokun Hu

arxiv: 2606.05652 · v1 · pith:7EGESOQLnew · submitted 2026-06-04 · 💻 cs.CV

CoFi-UCGen: Coarse-to-Fine Unsupervised Conditional Generation without Label Priors

Shengxi Li , Zhaokun Hu , Ce Zheng , Mai Xu , Jingyuan Xia , Si Liu This is my paper

Pith reviewed 2026-06-28 02:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords unsupervised conditional generationcoarse-to-fine frameworkbit-codesdiffusion modelssemantic disentanglementlabel-free image synthesisadversarial learninghierarchical modulation

0 comments

The pith

CoFi-UCGen disentangles global and fine-grained semantics to enable label-free conditional image generation at both coarse and fine levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a coarse-to-fine framework for unsupervised conditional image generation that works without any annotated labels. It uses an adversarial semantic reciprocal learning approach to create consistent mappings between images and a latent space structured by bit-codes for global semantics. These bit-codes then support a hierarchical modulation in diffusion models to progressively control finer details during image synthesis. A sympathetic reader would care because this removes the dependency on expensive labeling while still allowing precise control over generated content at different scales. If correct, it would advance generative models toward more autonomous and flexible operation on raw data.

Core claim

The central claim is that an adversarial semantic reciprocal learning theory can ensure semantic consistency and completeness between images and latent spaces, permitting bit-codes to encode distinct global semantics independently of noise sampling, which in turn supports building a fine-grained semantic basis and hierarchical modulation in diffusion models for layer-wise control of fine attributes.

What carries the argument

Bit-codes for structured coarse-grained latent space combined with hierarchical modulation mechanism in diffusion models.

If this is right

Achieves both coarse- and fine-grained conditional generation without labels.
Consistently outperforms prior UCGen methods on image quality, semantic consistency, and control accuracy.
Maintains independent noise sampling while capturing global semantics.
Enables layer-wise injection from coarse conditions to fine attributes in diffusion models.
Works without pre-trained feature extractors or label priors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method might generalize to video or 3D generation tasks where semantic hierarchies are important.
It suggests that explicit decomposition of semantics can improve controllability even in unsupervised settings.
Applications could include more accessible creative tools that do not require annotated training data.
Further work could explore the scalability of bit-codes to higher numbers of semantic classes.

Load-bearing premise

The adversarial semantic reciprocal learning theory ensures semantic consistency and completeness between images and latent spaces so that bit-codes can capture distinct global semantics independently of noise.

What would settle it

Running the generation process with fixed noise but varied bit-codes and checking if the resulting images exhibit clearly distinct global semantic attributes without any label supervision during training.

Figures

Figures reproduced from arXiv: 2606.05652 by Ce Zheng, Jingyuan Xia, Mai Xu, Shengxi Li, Si Liu, Zhaokun Hu.

**Figure 1.** Figure 1: Illustration of our CoFi-UCGen method. The top panel visualizes the coarse-grained semantic space, whereby images can be interpreted as coarse-grained clusters (e.g., red flowers). By anchoring on a specific coarse condition (highlighted in the green box), our model employs hierarchical modulation to guide the diffusion process. The bottom panel demonstrates the fine-grained controllability, in which the m… view at source ↗

**Figure 2.** Figure 2: Illustration of existing paradigms in UCGen, against [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed CoFi-UCGen framework. At the [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the training phase of coarse-grained [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration on the relationship between reciprocal [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of the training phase of fine-grained framework in our CoFi-UCGen. Note that real images and the encoder [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of the semantic hierarchy in a U [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Evaluations on the synthetic dataset, which consists of [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of both coarse- and fine-grained UCGen on Stanford Cars (left two panels, [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of both coarse- and fine-grained UCGen on Oxford102-Flowers (left two panels, [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Latent space continuity under hierarchical diffusion [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

read the original abstract

Unsupervised conditional image generation (UCGen) aims to control generation without relying on manually annotated labels, yet remains challenging due to unstructured semantic representations across granularities. To address this, we propose a novel coarse-to-fine UCGen framework (CoFi-UCGen) that explicitly disentangles global semantics from fine-grained variations, which to the best of our knowledge, sets out the first successful attempt for both coarse- and fine-grained conditional generation without any labels. More specifically, we first propose the adversarial semantic reciprocal learning theory to ensure the semantic consistency and completeness between images and latent spaces. Based on the consistency, we propose the bit-codes to learn a structured coarse-grained latent space, and further prove distinct global semantics inherent from our bit-codes while preserving independent noise sampling for generation. Building upon these bit-codes, we establish a fine-grained semantic basis and introduce a hierarchical modulation mechanism in diffusion models, by enabling layer-wise injection from coarse conditions to progressively control fine-grained attributes during generation. Extensive experiments demonstrate that without any label priors or pre-trained feature extractors, our CoFi-UCGen consistently outperforms existing UCGen methods in terms of image quality, semantic consistency, and control accuracy, verifying the effectiveness of explicit coarse-to-fine semantic decomposition for the challenging UCGen task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper puts forward an explicit coarse-to-fine structure for label-free conditional generation via bit-codes and layer-wise modulation, but the claimed proof of distinct global semantics reduces to the behavior of the proposed losses without a separation argument.

read the letter

The new piece is the bit-code construction for coarse semantics inside a diffusion model, paired with hierarchical modulation that injects those codes layer by layer to handle finer attributes. That decomposition is presented as the first explicit attempt at both granularities without labels or pretrained extractors. The experiments report gains in image quality, semantic consistency, and control accuracy over prior UCGen baselines, which is the concrete evidence offered.

The load-bearing step is the assertion that adversarial semantic reciprocal learning produces bit-codes that prove distinct global semantics while keeping noise sampling independent. No derivation appears that would force separation (mutual-information lower bounds, injectivity, or similar), so the distinctness claim rests on the empirical outcome of the losses rather than a guarantee. If the full paper only shows that the codes correlate with some unsupervised factors, the theoretical language overreaches what is demonstrated.

The rest of the architecture follows standard diffusion conditioning once the codes are in place, so the main technical risk sits in that first step. The citation pattern looks typical for the subfield; nothing obviously missing or inflated.

This is for researchers already working on unsupervised conditioning in generative models who want a concrete architecture to try or extend. A reader looking for formal disentanglement results will find the evidence thinner than the abstract suggests.

It is worth sending to referees. The problem is real, the proposal is specific, and the reported results give something to evaluate, even if the separation claim needs more scrutiny.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes CoFi-UCGen, a coarse-to-fine framework for unsupervised conditional image generation (UCGen) that disentangles global semantics from fine-grained variations without any labels or pre-trained extractors. It introduces an adversarial semantic reciprocal learning theory to enforce consistency and completeness between images and latent spaces, derives bit-codes to structure a coarse-grained latent space (claimed to prove distinct global semantics while allowing independent noise sampling), and uses these to drive a hierarchical modulation mechanism inside diffusion models for progressive fine-grained control. Experiments are said to show consistent outperformance over prior UCGen methods on image quality, semantic consistency, and control accuracy, positioning the work as the first successful attempt at both coarse- and fine-grained label-free conditional generation.

Significance. If the central construction and empirical claims hold, the result would be significant: it would supply the first explicit mechanism for multi-granularity control in UCGen without label priors, potentially enabling more structured generation in annotation-scarce domains. The combination of a reciprocal-learning theory, bit-code discretization, and diffusion-based hierarchical modulation is a coherent architectural response to the unstructured semantics problem highlighted in the abstract.

major comments (2)

[Abstract] Abstract: the claim that bit-codes 'further prove distinct global semantics inherent from our bit-codes while preserving independent noise sampling' is load-bearing for the coarse-control contribution, yet the provided text supplies no derivation, theorem, or bound (e.g., mutual-information lower bound, injectivity argument, or separation guarantee) showing that the bit-code space must factor into distinct globals rather than merely correlating with some unsupervised factors. The 'proof' therefore reduces to the empirical behavior of the proposed losses, which is insufficient to support the stated guarantee.
[Abstract] Abstract: the manuscript asserts both 'proofs' and consistent outperformance, but the abstract contains no equations, loss definitions, or experimental protocol details; without these, the central claims cannot be assessed for internal consistency or reproducibility from the given material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on the abstract. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] the claim that bit-codes 'further prove distinct global semantics inherent from our bit-codes while preserving independent noise sampling' is load-bearing, yet the provided text supplies no derivation, theorem, or bound showing that the bit-code space must factor into distinct globals. The 'proof' therefore reduces to the empirical behavior of the proposed losses.

Authors: We agree the abstract itself contains no derivation, as is conventional due to length constraints. The full theoretical argument—including the mutual-information lower bound, injectivity argument, and separation guarantee derived from the adversarial semantic reciprocal learning theory—is given in Section 3.2 of the manuscript. We will revise the abstract to explicitly reference this section so readers know the guarantee is not merely empirical. revision: yes
Referee: [Abstract] the manuscript asserts both 'proofs' and consistent outperformance, but the abstract contains no equations, loss definitions, or experimental protocol details; without these, the central claims cannot be assessed for internal consistency or reproducibility from the given material.

Authors: Abstracts are intentionally free of equations and protocol details to remain concise and accessible; all loss definitions, theoretical derivations, and experimental protocols appear in Sections 3 and 4. The outperformance claims are substantiated by the quantitative results and ablation studies reported in the main text. No change to the abstract format is required. revision: no

Circularity Check

0 steps flagged

No significant circularity; derivation presented as novel proposal without reduction to inputs by construction

full rationale

The provided abstract and description introduce a new 'adversarial semantic reciprocal learning theory' as a proposal, followed by bit-codes derived from it and a claim to 'further prove' properties. No equations, self-citations, fitted parameters renamed as predictions, or uniqueness theorems from prior author work are quoted that would reduce the central claims to the inputs by definition. The derivation chain is therefore self-contained as a sequence of novel constructions rather than tautological or load-bearing self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities detailed beyond high-level concepts like bit-codes.

invented entities (1)

bit-codes no independent evidence
purpose: learn structured coarse-grained latent space with distinct global semantics
Introduced to ensure semantic consistency from adversarial reciprocal learning

pith-pipeline@v0.9.1-grok · 5772 in / 1096 out tokens · 40358 ms · 2026-06-28T02:20:59.621892+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 8 canonical work pages · 6 internal anchors

[1]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[2]

Generative adversarial nets,

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” inAdvances in Neural Inf. Process. Syst., 2014, vol. 27

2014
[3]

High-resolution image synthesis with latent diffu- sion models,

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer, “High-resolution image synthesis with latent diffu- sion models,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2022, pp. 10684–10695

2022
[4]

Ganspace: Discovering interpretable gan controls,

Erik H ¨ark¨onen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris, “Ganspace: Discovering interpretable gan controls,” inAdvances in Neural Inf. Process. Syst., 2020, vol. 33, pp. 9841–9850

2020
[5]

Diffusion models already have a semantic latent space,

Mingi Kwon, Jaeseok Jeong, and Youngjung Uh, “Diffusion models already have a semantic latent space,” inProc. Int. Conf. Learn. Representations, 2023

2023
[6]

Multi-label condi- tional generation from pre-trained models,

Magdalena Proszewska, Maciej Wołczyk, Maciej Zieba, Patryk Wielopolski, Łukasz Maziarka, and Marek ´Smieja, “Multi-label condi- tional generation from pre-trained models,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 9, pp. 6185–6198, 2024

2024
[7]

Conditional Generative Adversarial Nets

Mehdi Mirza and Simon Osindero, “Conditional generative adversarial nets,”arXiv preprint arXiv:1411.1784, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[8]

Conditional image synthesis with auxiliary classifier gans,

Augustus Odena, Christopher Olah, and Jonathon Shlens, “Conditional image synthesis with auxiliary classifier gans,” inProc. Int. Conf. Mach. Learn.PMLR, 2017, pp. 2642–2651

2017
[9]

Diffusion models beat gans on image synthesis,

Prafulla Dhariwal and Alexander Nichol, “Diffusion models beat gans on image synthesis,” inAdvances in Neural Inf. Process. Syst., 2021, vol. 34, pp. 8780–8794

2021
[10]

Neural characteristic function learning for conditional image generation,

Shengxi Li, Jialu Zhang, Yifei Li, Mai Xu, Xin Deng, and Li Li, “Neural characteristic function learning for conditional image generation,” in Proc. IEEE Int. Conf. Comp. Vis., 2023, pp. 7204–7214

2023
[11]

Generative attribute controller with conditional filtered generative adversarial net- works,

Takuhiro Kaneko, Kaoru Hiramatsu, and Kunio Kashino, “Generative attribute controller with conditional filtered generative adversarial net- works,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017, pp. 6089– 6098

2017
[12]

Attribute-guided face generation using conditional cyclegan,

Yongyi Lu, Yu-Wing Tai, and Chi-Keung Tang, “Attribute-guided face generation using conditional cyclegan,” inProc. Eur. Conf. Comp. Vis., 2018, pp. 282–297

2018
[13]

Conditional gans with auxiliary discriminative classifier,

Liang Hou, Qi Cao, Huawei Shen, Siyuan Pan, Xiaoshuang Li, and Xueqi Cheng, “Conditional gans with auxiliary discriminative classifier,” inProc. Int. Conf. Mach. Learn.PMLR, 2022, pp. 8888–8902

2022
[14]

TAC-GAN - Text Conditioned Auxiliary Classifier Generative Adversarial Network

Ayushman Dash, John Cristian Borges Gamboa, Sheraz Ahmed, Marcus Liwicki, and Muhammad Zeshan Afzal, “Tac-gan-text conditioned auxiliary classifier generative adversarial network,”arXiv preprint arXiv:1703.06412, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Text to image generation with semantic-spatial aware gan,

Wentong Liao, Kai Hu, Michael Ying Yang, and Bodo Rosenhahn, “Text to image generation with semantic-spatial aware gan,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2022, pp. 18187–18196

2022
[16]

Motiondiffuse: Text-driven human motion generation with diffusion model,

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu, “Motiondiffuse: Text-driven human motion generation with diffusion model,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 6, pp. 4115–4128, 2024

2024
[17]

Drb-gan: A dynamic resblock generative adversarial network for artistic style transfer,

Wenju Xu, Chengjiang Long, Ruisheng Wang, and Guanghui Wang, “Drb-gan: A dynamic resblock generative adversarial network for artistic style transfer,” inProc. IEEE Int. Conf. Comp. Vis., 2021, pp. 6383– 6392

2021
[18]

Stylediffusion: Controllable disentangled style transfer via diffusion models,

Zhizhong Wang, Lei Zhao, and Wei Xing, “Stylediffusion: Controllable disentangled style transfer via diffusion models,” inProc. IEEE Int. Conf. Comp. Vis., 2023, pp. 7677–7689

2023
[19]

Sketchygan: Towards diverse and realistic sketch to image synthesis,

Wengling Chen and James Hays, “Sketchygan: Towards diverse and realistic sketch to image synthesis,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018, pp. 9416–9425

2018
[20]

Image generation from sketch constraint using contextual gan,

Yongyi Lu, Shangzhe Wu, Yu-Wing Tai, and Chi-Keung Tang, “Image generation from sketch constraint using contextual gan,” inProc. Eur. Conf. Comp. Vis., 2018, pp. 205–220

2018
[21]

Conditional image generation with pixelcnn decoders,

Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al., “Conditional image generation with pixelcnn decoders,” inAdvances in Neural Inf. Process. Syst., 2016, vol. 29

2016
[23]

At- tribute2image: Conditional image generation from visual attributes,

Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee, “At- tribute2image: Conditional image generation from visual attributes,” in Proc. Eur. Conf. Comp. Vis.Springer, 2016, pp. 776–791

2016
[24]

A simple framework for contrastive learning of visual representations,

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton, “A simple framework for contrastive learning of visual representations,” inProc. Int. Conf. Mach. Learn.PmLR, 2020, pp. 1597–1607

2020
[25]

Emerging properties in self-supervised vision transformers,

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J ´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin, “Emerging properties in self-supervised vision transformers,” inProc. IEEE Int. Conf. Comp. Vis., 2021, pp. 9650–9660

2021
[26]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Warpedganspace: Finding non-linear rbf paths in gan latent space,

Christos Tzelepis, Georgios Tzimiropoulos, and Ioannis Patras, “Warpedganspace: Finding non-linear rbf paths in gan latent space,” inProc. IEEE Int. Conf. Comp. Vis., 2021, pp. 6393–6402

2021
[28]

Clustergan: Latent space clustering in generative adversarial networks,

Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and Sreeram Kan- nan, “Clustergan: Latent space clustering in generative adversarial networks,” inProc. Conf. AAAI, 2019, pp. 4610–4617

2019
[29]

Diverse Image Generation via Self-Conditioned GANs,

Steven Liu, Tongzhou Wang, David Bau, Jun-Yan Zhu, and Antonio Torralba, “Diverse Image Generation via Self-Conditioned GANs,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2020, pp. 14286–14295

2020
[30]

Gaussian Mixture Generative Adversarial Networks for Diverse Datasets, and the Unsupervised Clustering of Images

Matan Ben-Yosef and Daphna Weinshall, “Gaussian mixture generative adversarial networks for diverse datasets, and the unsupervised clustering of images,”arXiv preprint arXiv:1808.10356, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Unsuper- vised image generation with infinite generative adversarial networks,

Hui Ying, He Wang, Tianjia Shao, Yin Yang, and Kun Zhou, “Unsuper- vised image generation with infinite generative adversarial networks,” inProc. IEEE Int. Conf. Comp. Vis., 2021, pp. 14284–14293

2021
[32]

A style-based generator architecture for generative adversarial networks,

Tero Karras, Samuli Laine, and Timo Aila, “A style-based generator architecture for generative adversarial networks,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019, pp. 4401–4410

2019
[33]

Gan dissection: Visualizing and understanding generative adversarial networks,

David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B Tenenbaum, William T Freeman, and Antonio Torralba, “Gan dissection: Visualizing and understanding generative adversarial networks,” inProc. Int. Conf. Learn. Representations, 2019

2019
[34]

Label-efficient semantic segmentation with diffusion models,

Dmitry Baranchuk, Andrey V oynov, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko, “Label-efficient semantic segmentation with diffusion models,” inProc. Int. Conf. Learn. Representations, 2022

2022
[35]

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets,

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel, “InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets,” inAdvances in Neural Inf. Process. Syst., 2016

2016
[36]

Challenging common assumptions in the unsupervised learning of disentangled representations,

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Syl- vain Gelly, Bernhard Sch ¨olkopf, and Olivier Bachem, “Challenging common assumptions in the unsupervised learning of disentangled representations,” inProc. Int. Conf. Mach. Learn.PMLR, 2019, pp. 4114–4124

2019
[37]

Posterior collapse and latent variable non-identifiability,

Yixin Wang, David Blei, and John P Cunningham, “Posterior collapse and latent variable non-identifiability,”Advances in Neural Inf. Process. Syst., vol. 34, pp. 5443–5455, 2021

2021
[38]

Self-guided diffusion models,

Vincent Tao Hu, David W Zhang, Yuki M Asano, Gertjan J Burghouts, and Cees GM Snoek, “Self-guided diffusion models,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2023, pp. 18413–18422

2023
[39]

Instance-conditioned gan,

Arantxa Casanova, Marlene Careil, Jakob Verbeek, Michal Drozdzal, and Adriana Romero Soriano, “Instance-conditioned gan,” inAdvances in Neural Inf. Process. Syst., 2021. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17

2021
[40]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie, “Representation alignment for generation: Training diffusion transformers is easier than you think,” arXiv preprint arXiv:2410.06940, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Closed-form factorization of latent semantics in gans,

Yujun Shen and Bolei Zhou, “Closed-form factorization of latent semantics in gans,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2021, pp. 1532–1540

2021
[42]

Neural discrete representa- tion learning,

Aaron Van Den Oord, Oriol Vinyals, et al., “Neural discrete representa- tion learning,” inAdvances in Neural Inf. Process. Syst., 2017, vol. 30

2017
[43]

Taming transformers for high-resolution image synthesis,

Patrick Esser, Robin Rombach, and Bjorn Ommer, “Taming transformers for high-resolution image synthesis,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2021, pp. 12873–12883

2021
[44]

Towards principled methods for training generative adversarial networks,

Martin Arjovsky and Leon Bottou, “Towards principled methods for training generative adversarial networks,” inProc. Int. Conf. Learn. Representations, 2017

2017
[45]

arXiv preprint arXiv:1606.05908 , year=

Carl Doersch, “Tutorial on variational autoencoders,”arXiv preprint arXiv:1606.05908, 2016

work page arXiv 2016
[46]

Gan inversion: A survey,

Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang, “Gan inversion: A survey,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 3, pp. 3121–3138, 2022

2022
[47]

Pivotal tuning for latent-based editing of real images,

Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Cohen- Or, “Pivotal tuning for latent-based editing of real images,”ACM Transactions on graphics (TOG), vol. 42, no. 1, pp. 1–13, 2022

2022
[48]

Null-text inversion for editing real images using guided diffusion models,

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen- Or, “Null-text inversion for editing real images using guided diffusion models,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2023, pp. 6038– 6047

2023
[49]

Reciprocal gan through characteristic functions (rcf-gan),

Shengxi Li, Zeyang Yu, Min Xiang, and Danilo Mandic, “Reciprocal gan through characteristic functions (rcf-gan),”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2246–2263, 2022

2022
[50]

Contragan: Contrastive learning for conditional image generation,

Minguk Kang and Jaesik Park, “Contragan: Contrastive learning for conditional image generation,” inAdvances in Neural Inf. Process. Syst., 2020, vol. 33, pp. 21357–21369

2020
[51]

Training gans with stronger augmen- tations via contrastive discriminator,

Jongheon Jeong and Jinwoo Shin, “Training gans with stronger augmen- tations via contrastive discriminator,”arXiv preprint arXiv:2103.09742, 2021

work page arXiv 2021
[52]

3d object representations for fine-grained categorization,

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei, “3d object representations for fine-grained categorization,” inProceedings of the IEEE international conference on computer vision workshops, 2013, pp. 554–561

2013
[53]

Age progression/regression by conditional adversarial autoencoder,

Zhifei Zhang, Yang Song, and Hairong Qi, “Age progression/regression by conditional adversarial autoencoder,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn.IEEE, 2017

2017
[54]

The caltech-ucsd birds-200-2011 dataset,

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie, “The caltech-ucsd birds-200-2011 dataset,” 2011

2011
[55]

Automated flower classification over a large number of classes,

M-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” inProceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008

2008
[56]

Fine-grained image analysis with deep learning: A survey,

Xiu-Shen Wei, Yi-Zhe Song, Oisin Mac Aodha, Jianxin Wu, Yuxin Peng, Jinhui Tang, Jian Yang, and Serge Belongie, “Fine-grained image analysis with deep learning: A survey,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 8927–8948, 2021

2021
[57]

Bilinear cnn models for fine-grained visual recognition,

Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji, “Bilinear cnn models for fine-grained visual recognition,” inProc. IEEE Int. Conf. Comp. Vis., 2015, pp. 1449–1457

2015
[58]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” inAdvances in Neural Inf. Process. Syst., 2017, vol. 30

2017
[59]

Improved techniques for training gans,

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen, “Improved techniques for training gans,” in Advances in Neural Inf. Process. Syst., 2016, vol. 29

2016
[60]

39, Cambridge University Press Cambridge, 2008

Hinrich Sch ¨utze, Christopher D Manning, and Prabhakar Raghavan, Introduction to information retrieval, vol. 39, Cambridge University Press Cambridge, 2008

2008
[61]

Cluster ensembles—a knowledge reuse framework for combining multiple partitions,

Alexander Strehl and Joydeep Ghosh, “Cluster ensembles—a knowledge reuse framework for combining multiple partitions,”Journal of machine learning research, vol. 3, no. Dec, pp. 583–617, 2002

2002
[62]

Improved precision and recall metric for assessing generative models,

Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila, “Improved precision and recall metric for assessing generative models,” inAdvances in Neural Inf. Process. Syst., 2019, vol. 32

2019
[63]

Reliable fidelity and diversity metrics for generative models,

Muhammad Ferjad Naeem, Seong Joon Oh, Yunjey Young, Sungjin Baek, and Donghyeon Kim, “Reliable fidelity and diversity metrics for generative models,” inProc. Int. Conf. Mach. Learn.PMLR, 2020, pp. 7176–7185

2020
[64]

Masked autoencoders are scalable vision learners,

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ´ar, and Ross Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2022, pp. 16000–16009

2022
[65]

Masked siamese networks for label-efficient learning,

Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Mike Rabbat, and Nicolas Ballas, “Masked siamese networks for label-efficient learning,” inProc. Eur. Conf. Comp. Vis.Springer, 2022, pp. 456–473. Shengxi Li(Member, IEEE) received the Ph.D. degree in electrical and electronic engineering from Imp...

2022

[1] [1]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[2] [2]

Generative adversarial nets,

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” inAdvances in Neural Inf. Process. Syst., 2014, vol. 27

2014

[3] [3]

High-resolution image synthesis with latent diffu- sion models,

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer, “High-resolution image synthesis with latent diffu- sion models,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2022, pp. 10684–10695

2022

[4] [4]

Ganspace: Discovering interpretable gan controls,

Erik H ¨ark¨onen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris, “Ganspace: Discovering interpretable gan controls,” inAdvances in Neural Inf. Process. Syst., 2020, vol. 33, pp. 9841–9850

2020

[5] [5]

Diffusion models already have a semantic latent space,

Mingi Kwon, Jaeseok Jeong, and Youngjung Uh, “Diffusion models already have a semantic latent space,” inProc. Int. Conf. Learn. Representations, 2023

2023

[6] [6]

Multi-label condi- tional generation from pre-trained models,

Magdalena Proszewska, Maciej Wołczyk, Maciej Zieba, Patryk Wielopolski, Łukasz Maziarka, and Marek ´Smieja, “Multi-label condi- tional generation from pre-trained models,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 9, pp. 6185–6198, 2024

2024

[7] [7]

Conditional Generative Adversarial Nets

Mehdi Mirza and Simon Osindero, “Conditional generative adversarial nets,”arXiv preprint arXiv:1411.1784, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[8] [8]

Conditional image synthesis with auxiliary classifier gans,

Augustus Odena, Christopher Olah, and Jonathon Shlens, “Conditional image synthesis with auxiliary classifier gans,” inProc. Int. Conf. Mach. Learn.PMLR, 2017, pp. 2642–2651

2017

[9] [9]

Diffusion models beat gans on image synthesis,

Prafulla Dhariwal and Alexander Nichol, “Diffusion models beat gans on image synthesis,” inAdvances in Neural Inf. Process. Syst., 2021, vol. 34, pp. 8780–8794

2021

[10] [10]

Neural characteristic function learning for conditional image generation,

Shengxi Li, Jialu Zhang, Yifei Li, Mai Xu, Xin Deng, and Li Li, “Neural characteristic function learning for conditional image generation,” in Proc. IEEE Int. Conf. Comp. Vis., 2023, pp. 7204–7214

2023

[11] [11]

Generative attribute controller with conditional filtered generative adversarial net- works,

Takuhiro Kaneko, Kaoru Hiramatsu, and Kunio Kashino, “Generative attribute controller with conditional filtered generative adversarial net- works,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017, pp. 6089– 6098

2017

[12] [12]

Attribute-guided face generation using conditional cyclegan,

Yongyi Lu, Yu-Wing Tai, and Chi-Keung Tang, “Attribute-guided face generation using conditional cyclegan,” inProc. Eur. Conf. Comp. Vis., 2018, pp. 282–297

2018

[13] [13]

Conditional gans with auxiliary discriminative classifier,

Liang Hou, Qi Cao, Huawei Shen, Siyuan Pan, Xiaoshuang Li, and Xueqi Cheng, “Conditional gans with auxiliary discriminative classifier,” inProc. Int. Conf. Mach. Learn.PMLR, 2022, pp. 8888–8902

2022

[14] [14]

TAC-GAN - Text Conditioned Auxiliary Classifier Generative Adversarial Network

Ayushman Dash, John Cristian Borges Gamboa, Sheraz Ahmed, Marcus Liwicki, and Muhammad Zeshan Afzal, “Tac-gan-text conditioned auxiliary classifier generative adversarial network,”arXiv preprint arXiv:1703.06412, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Text to image generation with semantic-spatial aware gan,

Wentong Liao, Kai Hu, Michael Ying Yang, and Bodo Rosenhahn, “Text to image generation with semantic-spatial aware gan,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2022, pp. 18187–18196

2022

[16] [16]

Motiondiffuse: Text-driven human motion generation with diffusion model,

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu, “Motiondiffuse: Text-driven human motion generation with diffusion model,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 6, pp. 4115–4128, 2024

2024

[17] [17]

Drb-gan: A dynamic resblock generative adversarial network for artistic style transfer,

Wenju Xu, Chengjiang Long, Ruisheng Wang, and Guanghui Wang, “Drb-gan: A dynamic resblock generative adversarial network for artistic style transfer,” inProc. IEEE Int. Conf. Comp. Vis., 2021, pp. 6383– 6392

2021

[18] [18]

Stylediffusion: Controllable disentangled style transfer via diffusion models,

Zhizhong Wang, Lei Zhao, and Wei Xing, “Stylediffusion: Controllable disentangled style transfer via diffusion models,” inProc. IEEE Int. Conf. Comp. Vis., 2023, pp. 7677–7689

2023

[19] [19]

Sketchygan: Towards diverse and realistic sketch to image synthesis,

Wengling Chen and James Hays, “Sketchygan: Towards diverse and realistic sketch to image synthesis,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018, pp. 9416–9425

2018

[20] [20]

Image generation from sketch constraint using contextual gan,

Yongyi Lu, Shangzhe Wu, Yu-Wing Tai, and Chi-Keung Tang, “Image generation from sketch constraint using contextual gan,” inProc. Eur. Conf. Comp. Vis., 2018, pp. 205–220

2018

[21] [21]

Conditional image generation with pixelcnn decoders,

Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al., “Conditional image generation with pixelcnn decoders,” inAdvances in Neural Inf. Process. Syst., 2016, vol. 29

2016

[22] [23]

At- tribute2image: Conditional image generation from visual attributes,

Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee, “At- tribute2image: Conditional image generation from visual attributes,” in Proc. Eur. Conf. Comp. Vis.Springer, 2016, pp. 776–791

2016

[23] [24]

A simple framework for contrastive learning of visual representations,

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton, “A simple framework for contrastive learning of visual representations,” inProc. Int. Conf. Mach. Learn.PmLR, 2020, pp. 1597–1607

2020

[24] [25]

Emerging properties in self-supervised vision transformers,

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J ´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin, “Emerging properties in self-supervised vision transformers,” inProc. IEEE Int. Conf. Comp. Vis., 2021, pp. 9650–9660

2021

[25] [26]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [27]

Warpedganspace: Finding non-linear rbf paths in gan latent space,

Christos Tzelepis, Georgios Tzimiropoulos, and Ioannis Patras, “Warpedganspace: Finding non-linear rbf paths in gan latent space,” inProc. IEEE Int. Conf. Comp. Vis., 2021, pp. 6393–6402

2021

[27] [28]

Clustergan: Latent space clustering in generative adversarial networks,

Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and Sreeram Kan- nan, “Clustergan: Latent space clustering in generative adversarial networks,” inProc. Conf. AAAI, 2019, pp. 4610–4617

2019

[28] [29]

Diverse Image Generation via Self-Conditioned GANs,

Steven Liu, Tongzhou Wang, David Bau, Jun-Yan Zhu, and Antonio Torralba, “Diverse Image Generation via Self-Conditioned GANs,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2020, pp. 14286–14295

2020

[29] [30]

Gaussian Mixture Generative Adversarial Networks for Diverse Datasets, and the Unsupervised Clustering of Images

Matan Ben-Yosef and Daphna Weinshall, “Gaussian mixture generative adversarial networks for diverse datasets, and the unsupervised clustering of images,”arXiv preprint arXiv:1808.10356, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [31]

Unsuper- vised image generation with infinite generative adversarial networks,

Hui Ying, He Wang, Tianjia Shao, Yin Yang, and Kun Zhou, “Unsuper- vised image generation with infinite generative adversarial networks,” inProc. IEEE Int. Conf. Comp. Vis., 2021, pp. 14284–14293

2021

[31] [32]

A style-based generator architecture for generative adversarial networks,

Tero Karras, Samuli Laine, and Timo Aila, “A style-based generator architecture for generative adversarial networks,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019, pp. 4401–4410

2019

[32] [33]

Gan dissection: Visualizing and understanding generative adversarial networks,

David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B Tenenbaum, William T Freeman, and Antonio Torralba, “Gan dissection: Visualizing and understanding generative adversarial networks,” inProc. Int. Conf. Learn. Representations, 2019

2019

[33] [34]

Label-efficient semantic segmentation with diffusion models,

Dmitry Baranchuk, Andrey V oynov, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko, “Label-efficient semantic segmentation with diffusion models,” inProc. Int. Conf. Learn. Representations, 2022

2022

[34] [35]

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets,

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel, “InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets,” inAdvances in Neural Inf. Process. Syst., 2016

2016

[35] [36]

Challenging common assumptions in the unsupervised learning of disentangled representations,

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Syl- vain Gelly, Bernhard Sch ¨olkopf, and Olivier Bachem, “Challenging common assumptions in the unsupervised learning of disentangled representations,” inProc. Int. Conf. Mach. Learn.PMLR, 2019, pp. 4114–4124

2019

[36] [37]

Posterior collapse and latent variable non-identifiability,

Yixin Wang, David Blei, and John P Cunningham, “Posterior collapse and latent variable non-identifiability,”Advances in Neural Inf. Process. Syst., vol. 34, pp. 5443–5455, 2021

2021

[37] [38]

Self-guided diffusion models,

Vincent Tao Hu, David W Zhang, Yuki M Asano, Gertjan J Burghouts, and Cees GM Snoek, “Self-guided diffusion models,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2023, pp. 18413–18422

2023

[38] [39]

Instance-conditioned gan,

Arantxa Casanova, Marlene Careil, Jakob Verbeek, Michal Drozdzal, and Adriana Romero Soriano, “Instance-conditioned gan,” inAdvances in Neural Inf. Process. Syst., 2021. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17

2021

[39] [40]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie, “Representation alignment for generation: Training diffusion transformers is easier than you think,” arXiv preprint arXiv:2410.06940, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [41]

Closed-form factorization of latent semantics in gans,

Yujun Shen and Bolei Zhou, “Closed-form factorization of latent semantics in gans,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2021, pp. 1532–1540

2021

[41] [42]

Neural discrete representa- tion learning,

Aaron Van Den Oord, Oriol Vinyals, et al., “Neural discrete representa- tion learning,” inAdvances in Neural Inf. Process. Syst., 2017, vol. 30

2017

[42] [43]

Taming transformers for high-resolution image synthesis,

Patrick Esser, Robin Rombach, and Bjorn Ommer, “Taming transformers for high-resolution image synthesis,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2021, pp. 12873–12883

2021

[43] [44]

Towards principled methods for training generative adversarial networks,

Martin Arjovsky and Leon Bottou, “Towards principled methods for training generative adversarial networks,” inProc. Int. Conf. Learn. Representations, 2017

2017

[44] [45]

arXiv preprint arXiv:1606.05908 , year=

Carl Doersch, “Tutorial on variational autoencoders,”arXiv preprint arXiv:1606.05908, 2016

work page arXiv 2016

[45] [46]

Gan inversion: A survey,

Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang, “Gan inversion: A survey,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 3, pp. 3121–3138, 2022

2022

[46] [47]

Pivotal tuning for latent-based editing of real images,

Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Cohen- Or, “Pivotal tuning for latent-based editing of real images,”ACM Transactions on graphics (TOG), vol. 42, no. 1, pp. 1–13, 2022

2022

[47] [48]

Null-text inversion for editing real images using guided diffusion models,

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen- Or, “Null-text inversion for editing real images using guided diffusion models,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2023, pp. 6038– 6047

2023

[48] [49]

Reciprocal gan through characteristic functions (rcf-gan),

Shengxi Li, Zeyang Yu, Min Xiang, and Danilo Mandic, “Reciprocal gan through characteristic functions (rcf-gan),”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2246–2263, 2022

2022

[49] [50]

Contragan: Contrastive learning for conditional image generation,

Minguk Kang and Jaesik Park, “Contragan: Contrastive learning for conditional image generation,” inAdvances in Neural Inf. Process. Syst., 2020, vol. 33, pp. 21357–21369

2020

[50] [51]

Training gans with stronger augmen- tations via contrastive discriminator,

Jongheon Jeong and Jinwoo Shin, “Training gans with stronger augmen- tations via contrastive discriminator,”arXiv preprint arXiv:2103.09742, 2021

work page arXiv 2021

[51] [52]

3d object representations for fine-grained categorization,

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei, “3d object representations for fine-grained categorization,” inProceedings of the IEEE international conference on computer vision workshops, 2013, pp. 554–561

2013

[52] [53]

Age progression/regression by conditional adversarial autoencoder,

Zhifei Zhang, Yang Song, and Hairong Qi, “Age progression/regression by conditional adversarial autoencoder,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn.IEEE, 2017

2017

[53] [54]

The caltech-ucsd birds-200-2011 dataset,

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie, “The caltech-ucsd birds-200-2011 dataset,” 2011

2011

[54] [55]

Automated flower classification over a large number of classes,

M-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” inProceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008

2008

[55] [56]

Fine-grained image analysis with deep learning: A survey,

Xiu-Shen Wei, Yi-Zhe Song, Oisin Mac Aodha, Jianxin Wu, Yuxin Peng, Jinhui Tang, Jian Yang, and Serge Belongie, “Fine-grained image analysis with deep learning: A survey,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 8927–8948, 2021

2021

[56] [57]

Bilinear cnn models for fine-grained visual recognition,

Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji, “Bilinear cnn models for fine-grained visual recognition,” inProc. IEEE Int. Conf. Comp. Vis., 2015, pp. 1449–1457

2015

[57] [58]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” inAdvances in Neural Inf. Process. Syst., 2017, vol. 30

2017

[58] [59]

Improved techniques for training gans,

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen, “Improved techniques for training gans,” in Advances in Neural Inf. Process. Syst., 2016, vol. 29

2016

[59] [60]

39, Cambridge University Press Cambridge, 2008

Hinrich Sch ¨utze, Christopher D Manning, and Prabhakar Raghavan, Introduction to information retrieval, vol. 39, Cambridge University Press Cambridge, 2008

2008

[60] [61]

Cluster ensembles—a knowledge reuse framework for combining multiple partitions,

Alexander Strehl and Joydeep Ghosh, “Cluster ensembles—a knowledge reuse framework for combining multiple partitions,”Journal of machine learning research, vol. 3, no. Dec, pp. 583–617, 2002

2002

[61] [62]

Improved precision and recall metric for assessing generative models,

Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila, “Improved precision and recall metric for assessing generative models,” inAdvances in Neural Inf. Process. Syst., 2019, vol. 32

2019

[62] [63]

Reliable fidelity and diversity metrics for generative models,

Muhammad Ferjad Naeem, Seong Joon Oh, Yunjey Young, Sungjin Baek, and Donghyeon Kim, “Reliable fidelity and diversity metrics for generative models,” inProc. Int. Conf. Mach. Learn.PMLR, 2020, pp. 7176–7185

2020

[63] [64]

Masked autoencoders are scalable vision learners,

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ´ar, and Ross Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2022, pp. 16000–16009

2022

[64] [65]

Masked siamese networks for label-efficient learning,

Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Mike Rabbat, and Nicolas Ballas, “Masked siamese networks for label-efficient learning,” inProc. Eur. Conf. Comp. Vis.Springer, 2022, pp. 456–473. Shengxi Li(Member, IEEE) received the Ph.D. degree in electrical and electronic engineering from Imp...

2022