pith. sign in

arxiv: 1907.01144 · v1 · pith:YJATWLINnew · submitted 2019-07-02 · 💻 cs.CV

Disentangled Makeup Transfer with Generative Adversarial Network

Pith reviewed 2026-05-25 11:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords makeup transfergenerative adversarial networkdisentangled representationface synthesisstyle transferidentity preservationGAN
0
0 comments X

The pith

A GAN disentangles identity from makeup style to support strength-controlled transfer and style sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DMT, a generative adversarial network that uses an identity encoder and a makeup encoder to separate personal identity from makeup style in arbitrary face images. A decoder reconstructs faces from these separate encodings while a discriminator enforces realism. This setup permits transferring makeup from one or more reference images to a source face at adjustable strength levels, and also allows drawing multiple varied outputs by sampling makeup styles from a prior distribution. Prior methods produced only single fixed outputs without independent control. A reader would care because the separation promises more flexible digital face editing than rigid transfer approaches.

Core claim

The model employs an identity encoder and a makeup encoder to disentangle personal identity and makeup style for arbitrary face images. Based on the outputs of the two encoders, a decoder reconstructs the original faces, and a discriminator distinguishes real faces from generated ones. As a result, the model can transfer makeup styles from one or more reference face images to a non-makeup face with controllable strength and produce various outputs with styles sampled from a prior distribution.

What carries the argument

The identity encoder and makeup encoder that disentangle personal identity from makeup style, allowing independent control in the decoder.

If this is right

  • Makeup can be transferred from single or multiple reference images to a non-makeup source face.
  • The transferred makeup strength can be adjusted continuously during generation.
  • Multiple distinct outputs can be produced by sampling makeup styles from a learned prior distribution.
  • Generated faces remain high-quality and realistic across these different transfer scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same encoder separation might be reused to control other facial attributes such as age or expression without retraining the full model.
  • Interactive editing tools could let users drag a strength slider and see immediate results on uploaded photos.
  • Sampling from the prior could generate large synthetic datasets of made-up faces for training downstream recognition systems.

Load-bearing premise

The two encoders can separate identity information from makeup information without mixing or loss for any input face images.

What would settle it

A test set where increasing the makeup strength parameter either alters the source person's identity or produces outputs that no longer match the reference makeup style would show the disentanglement has failed.

Figures

Figures reproduced from arXiv: 1907.01144 by Hao He, Honglun Zhang, Wenqing Chen, Yaohui Jin.

Figure 1
Figure 1. Figure 1: Different scenarios of makeup transfer. Most related researches only focus on the pair-wise makeup transfer. In contrast, our model [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The disentangled architecture of DMT, which contains [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Calculation of the makeup loss. We first perform histogram matching on different cosmetic regions of [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Detailed structures of Ei, Em, G and D, where blocks of different colors denote different types of neural layers. x M′ M [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of makeup-related region M0 and the generated attention mask M. to conduct pair-wise makeup transfer between x and y. Apart from generating the face image x˜s, G also learns to produce an attention mask M ∈ [0, 1]H×W to localize the makeup￾related region, where higher values mean stronger relation. Based on the above definition of M, we obtain the refined re￾sult by selectively extracting the rela… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation study by removing L G face, L G brow, L G eye, L G lip from DMT respectively. In Fig.5, we use blocks of different colors to denote dif￾ferent types of neural layers and illustrate the network struc￾tures of Ei , Em, G, D in details. We specify the settings of convolution layers with the attached texts. For example, k7n64s1 means a convolution layer with 64 filters of kernel size 7 × 7 and stride … view at source ↗
Figure 8
Figure 8. Figure 8: Ablation study of the attention mask M, the attention loss L G a and the perceptual loss L G per [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Transfer results of DMT against the baselines. DMT can achieve high-quality results and well preserve makeup-unrelated content. [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Transfer results and residual images of DMT against BG for more makeup styles. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of the learned makeup distribution after dimension reduction. [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: hybrid makeup transfer of DMT by combining the [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Face interpolation of DMT by combining the identity [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗
Figure 17
Figure 17. Figure 17: Linear interpolation on different dimensions of [PITH_FULL_IMAGE:figures/full_fig_p010_17.png] view at source ↗
Figure 16
Figure 16. Figure 16: Multi-modal makeup transfer of DMT by randomly sam [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗
read the original abstract

Facial makeup transfer is a widely-used technology that aims to transfer the makeup style from a reference face image to a non-makeup face. Existing literature leverage the adversarial loss so that the generated faces are of high quality and realistic as real ones, but are only able to produce fixed outputs. Inspired by recent advances in disentangled representation, in this paper we propose DMT (Disentangled Makeup Transfer), a unified generative adversarial network to achieve different scenarios of makeup transfer. Our model contains an identity encoder as well as a makeup encoder to disentangle the personal identity and the makeup style for arbitrary face images. Based on the outputs of the two encoders, a decoder is employed to reconstruct the original faces. We also apply a discriminator to distinguish real faces from fake ones. As a result, our model can not only transfer the makeup styles from one or more reference face images to a non-makeup face with controllable strength, but also produce various outputs with styles sampled from a prior distribution. Extensive experiments demonstrate that our model is superior to existing literature by generating high-quality results for different scenarios of makeup transfer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes DMT, a GAN with an identity encoder, a makeup encoder, a decoder, and a discriminator. The encoders are intended to disentangle personal identity from makeup style on arbitrary faces; their outputs are combined by the decoder to reconstruct or transfer makeup. The central claims are that this enables (i) makeup transfer from one or more reference images with controllable strength and (ii) generation of diverse outputs by sampling makeup styles from a prior distribution, with the model asserted to be superior to prior work on the basis of extensive experiments.

Significance. If the claimed disentanglement holds and is supported by appropriate quantitative evidence, the architecture would offer a more flexible alternative to fixed-output makeup transfer methods, supporting both reference-driven transfer and unconditional sampling. The approach aligns with broader trends in disentangled representation learning for image manipulation.

major comments (2)
  1. [Abstract] Abstract: The claim that the identity encoder and makeup encoder 'disentangle the personal identity and the makeup style' is load-bearing for both controllable transfer and prior sampling, yet the abstract supplies no description of loss terms (e.g., explicit invariance penalties, mutual-information minimization, or cycle-consistency constraints) that would force the identity encoder to ignore makeup variations and the makeup encoder to ignore identity cues. Standard reconstruction plus adversarial losses alone do not guarantee this separation.
  2. [Abstract] Abstract: Superiority is asserted via 'extensive experiments' that 'demonstrate that our model is superior,' but no quantitative metrics (FID, PSNR, user-study percentages, or comparison tables), training details, or failure-case analysis are referenced. This absence prevents verification of whether the encoders actually achieve the required factor separation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. Below we respond point by point to the major comments and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the identity encoder and makeup encoder 'disentangle the personal identity and the makeup style' is load-bearing for both controllable transfer and prior sampling, yet the abstract supplies no description of loss terms (e.g., explicit invariance penalties, mutual-information minimization, or cycle-consistency constraints) that would force the identity encoder to ignore makeup variations and the makeup encoder to ignore identity cues. Standard reconstruction plus adversarial losses alone do not guarantee this separation.

    Authors: The abstract is a concise summary and therefore omits the specific loss formulations, which are presented in Section 3 of the manuscript. There the identity encoder is trained with a reconstruction objective on the source face while the makeup encoder is trained to extract style features that are combined by the decoder; the separate encoder pathways and the reconstruction objective are intended to encourage the desired factor separation. We acknowledge that the abstract does not make this explicit and will revise it to include a brief reference to the reconstruction and adversarial losses that support disentanglement. revision: yes

  2. Referee: [Abstract] Abstract: Superiority is asserted via 'extensive experiments' that 'demonstrate that our model is superior,' but no quantitative metrics (FID, PSNR, user-study percentages, or comparison tables), training details, or failure-case analysis are referenced. This absence prevents verification of whether the encoders actually achieve the required factor separation.

    Authors: The abstract summarizes the outcome of the experiments; the full manuscript reports quantitative comparisons using FID, user-study percentages, and side-by-side tables against prior methods, together with training details and selected failure cases. To address the referee's concern we will revise the abstract to mention that superiority is demonstrated via quantitative metrics and user studies. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture description contains no derivations or fitted predictions

full rationale

The paper presents a GAN-based model with identity and makeup encoders feeding a decoder, plus a discriminator. No equations, parameter-fitting steps, or predictions are described that reduce to inputs by construction. The disentanglement claim rests on the stated architecture and (unstated) training losses rather than any self-referential reduction or self-citation chain. This matches the default expectation of a non-circular empirical ML paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into specific parameters or assumptions; the model implicitly assumes standard GAN training dynamics and successful feature separation without providing evidence of either.

free parameters (1)
  • makeup strength control
    Mentioned as controllable but no specific parameterization or fitting procedure described.
axioms (1)
  • domain assumption Separate encoders can disentangle identity from makeup style in face images
    Central design choice invoked to enable independent control and sampling.

pith-pipeline@v0.9.0 · 5725 in / 1112 out tokens · 19056 ms · 2026-05-25T11:32:50.986511+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 7 internal anchors

  1. [1]

    [Ba et al., 2016] Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. CoRR, abs/1607.06450,

  2. [2]

    Courville, and Pascal Vincent

    [Bengio et al., 2013] Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828,

  3. [3]

    Attention-gan for object transfig- uration in wild images

    [Chen et al., 2018] Xinyuan Chen, Chang Xu, Xiaokang Yang, and Dacheng Tao. Attention-gan for object transfig- uration in wild images. In ECCV, pages 167–184,

  4. [4]

    StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

    [Choi et al., 2017] Yunjey Choi, Min-Je Choi, Muny- oung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. CoRR, abs/1711.09020,

  5. [5]

    A Neural Algorithm of Artistic Style

    [Gatys et al., 2015] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. A neural algorithm of artistic style. CoRR, abs/1508.06576,

  6. [6]

    Goodfellow, Jean Pouget- Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C

    [Goodfellow et al., 2014] Ian J. Goodfellow, Jean Pouget- Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, pages 2672– 2680,

  7. [7]

    Courville

    [Gulrajani et al., 2017] Ishaan Gulrajani, Faruk Ahmed, Mart´ın Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. In NeurIPS, pages 5769–5779,

  8. [8]

    Digital face makeup by example

    [Guo and Sim, 2009] Dong Guo and Terence Sim. Digital face makeup by example. In CVPR, pages 73–79,

  9. [9]

    Delving deep into rectifiers: Surpass- ing human-level performance on imagenet classification

    [He et al., 2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpass- ing human-level performance on imagenet classification. In ICCV, pages 1026–1034,

  10. [10]

    Be- longie

    [Huang and Belongie, 2017] Xun Huang and Serge J. Be- longie. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, pages 1510–1519,

  11. [11]

    Be- longie, and Jan Kautz

    [Huang et al., 2018] Xun Huang, Ming-Yu Liu, Serge J. Be- longie, and Jan Kautz. Multimodal unsupervised image- to-image translation. In ECCV, pages 179–196,

  12. [12]

    [Isola et al., 2017] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with con- ditional adversarial networks. InCVPR, pages 5967–5976,

  13. [13]

    Perceptual losses for real-time style transfer and super-resolution

    [Johnson et al., 2016] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, pages 694–711,

  14. [14]

    Learning to dis- cover cross-domain relations with generative adversarial networks

    [Kim et al., 2017] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to dis- cover cross-domain relations with generative adversarial networks. In ICML, pages 1857–1865,

  15. [15]

    Adam: A Method for Stochastic Optimization

    [Kingma and Ba, 2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,

  16. [16]

    Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi

    [Ledig et al., 2017] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single im- age super-resolution using a generative adversarial net- work. In CVPR, pages 105–114,

  17. [17]

    Diverse image-to-image translation via disentangled representa- tions

    [Lee et al., 2018] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Diverse image-to-image translation via disentangled representa- tions. In ECCV, pages 36–52,

  18. [18]

    Maskgan: Towards diverse and interactive facial image manipulation

    [Lee et al., 2019] Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. Maskgan: Towards diverse and interactive facial image manipulation. Technical Report,

  19. [19]

    Sim- ulating makeup through physics-based manipulation of in- trinsic image layers

    [Li et al., 2015] Chen Li, Kun Zhou, and Stephen Lin. Sim- ulating makeup through physics-based manipulation of in- trinsic image layers. In CVPR, pages 4621–4629,

  20. [20]

    Beautygan: Instance-level facial makeup transfer with deep generative adversarial network

    [Li et al., 2018] Tingting Li, Ruihe Qian, Chao Dong, Si Liu, Qiong Yan, Wenwu Zhu, and Liang Lin. Beautygan: Instance-level facial makeup transfer with deep generative adversarial network. In ACM MM, pages 645–653,

  21. [21]

    Visual attribute transfer through deep image analogy

    [Liao et al., 2017] Jing Liao, Yuan Yao, Lu Yuan, Gang Hua, and Sing Bing Kang. Visual attribute transfer through deep image analogy. ACM Trans. Graph., 36(4):120:1–120:15,

  22. [22]

    Makeup like a superstar: Deep lo- calized makeup transfer network

    [Liu et al., 2016] Si Liu, Xinyu Ou, Ruihe Qian, Wei Wang, and Xiaochun Cao. Makeup like a superstar: Deep lo- calized makeup transfer network. In IJCAI, pages 2568– 2575,

  23. [23]

    Reda, Kevin J

    [Liu et al., 2018] Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Im- age inpainting for irregular holes using partial convolu- tions. In ECCV, pages 89–105,

  24. [24]

    Dis- entangled person image generation

    [Ma et al., 2018] Liqian Ma, Qianru Sun, Stamatios Geor- goulis, Luc Van Gool, Bernt Schiele, and Mario Fritz. Dis- entangled person image generation. In CVPR, pages 99– 108,

  25. [25]

    [Mao et al., 2017] Xudong Mao, Qing Li, Haoran Xie, Ray- mond Y . K. Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In ICCV, pages 2813–2821,

  26. [26]

    Unsupervised attention-guided image-to-image translation

    [Mejjati et al., 2018] Youssef Alami Mejjati, Christian Richardt, James Tompkin, Darren Cosker, and Kwang In Kim. Unsupervised attention-guided image-to-image translation. In NeurIPS, pages 3697–3707,

  27. [27]

    Martinez, Alberto Sanfeliu, and Francesc Moreno-Noguer

    [Pumarola et al., 2018] Albert Pumarola, Antonio Agudo, Aleix M. Martinez, Alberto Sanfeliu, and Francesc Moreno-Noguer. Ganimation: Anatomically-aware facial animation from a single image. In ECCV, pages 835–851,

  28. [28]

    Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

    [Radford et al., 2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434,

  29. [29]

    Bernstein, Alexander C

    [Russakovsky et al., 2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252,

  30. [30]

    Very deep convolutional networks for large-scale image recognition

    [Simonyan and Zisserman, 2015] Karen Simonyan and An- drew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR,

  31. [31]

    Smith, Li Zhang, Jonathan Brandt, Zhe Lin, and Jianchao Yang

    [Smith et al., 2013] Brandon M. Smith, Li Zhang, Jonathan Brandt, Zhe Lin, and Jianchao Yang. Exemplar-based face parsing. In CVPR, pages 3484–3491,

  32. [32]

    Brown, and Ying-Qing Xu

    [Tong et al., 2007] Wai-Shun Tong, Chi-Keung Tang, Michael S. Brown, and Ying-Qing Xu. Example-based cosmetic transfer. In PCCGA, pages 211–218,

  33. [33]

    Instance Normalization: The Missing Ingredient for Fast Stylization

    [Ulyanov et al., 2016] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The miss- ing ingredient for fast stylization. CoRR, abs/1607.08022,

  34. [34]

    Bovik, Hamid R

    [Wang et al., 2004] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Processing, 13(4):600–612,

  35. [35]

    [Yang et al., 2018] Chao Yang, Taehwan Kim, Ruizhe Wang, Hao Peng, and C.-C. Jay Kuo. Show, attend and translate: Unsupervised image translation with self-regularization and attention. CoRR, abs/1806.06195,

  36. [36]

    Dualgan: Unsupervised dual learning for image-to-image translation

    [Yi et al., 2017] Zili Yi, Hao (Richard) Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV, pages 2868–2876,

  37. [37]

    Bisenet: Bi- lateral segmentation network for real-time semantic seg- mentation

    [Yu et al., 2018] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bi- lateral segmentation network for real-time semantic seg- mentation. In ECCV, pages 334–349,

  38. [38]

    Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks

    [Zhang et al., 2017] Han Zhang, Tao Xu, and Hongsheng Li. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, pages 5908–5916,

  39. [39]

    Generative adversarial network with spatial attention for face attribute editing

    [Zhang et al., 2018] Gang Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Generative adversarial network with spatial attention for face attribute editing. In ECCV, pages 422–437,

  40. [40]

    Pyramid scene parsing network

    [Zhao et al., 2017] Hengshuang Zhao, Jianping Shi, Xiao- juan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, pages 6230–6239,

  41. [41]

    [Zhu et al., 2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image transla- tion using cycle-consistent adversarial networks. CoRR, abs/1703.10593, 2017