WIDER-FAIR: An Annotated Version of the WIDER-FACE Dataset for Fairness Evaluation

Beno\^it Ronval; F\'elicien Schiltz; Maxime Moussi; Siegfried Nijssen

arxiv: 2606.31704 · v1 · pith:5SPYBK4Dnew · submitted 2026-06-30 · 💻 cs.CV · cs.LG

WIDER-FAIR: An Annotated Version of the WIDER-FACE Dataset for Fairness Evaluation

Maxime Moussi , Beno\^it Ronval , Siegfried Nijssen , F\'elicien Schiltz This is my paper

Pith reviewed 2026-07-01 05:35 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords face detectionfairnessbias evaluationethnicity annotationWIDER-FACEdemographic disparitiesYOLOv5

0 comments

The pith

WIDER-FAIR annotations show face detection models perform worse on Black faces and that excluding them from training widens fairness gaps more than any other group.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adds manual labels for perceived ethnicity and sex to every face in the WIDER-FACE benchmark, creating the WIDER-FAIR dataset with four ethnic categories and two sex categories. Validation with face embeddings, KNN classification, and t-SNE plots confirms the labels are coherent. Training a YOLOv5 detector on the annotated data and running ablations reveals lower detection rates for Black faces. Removing Black faces from the training set produces a larger increase in group-wise performance gaps than removing any other ethnic group.

Core claim

By manually annotating WIDER-FACE with perceived ethnicity (Asian, Black, Indian, White) and sex, the authors enable ablation experiments that establish lower detection performance for Black faces and show that excluding the Black group from training increases fairness disparity more than excluding any other ethnic group.

What carries the argument

The WIDER-FAIR dataset of 16,256 images with manual annotations of perceived ethnicity and sex, used to support training ablations and fairness measurements on a YOLOv5 detector.

If this is right

Face detection accuracy varies across the four annotated ethnic groups.
Black faces exhibit the lowest detection performance among the groups tested.
Excluding Black faces from training data produces a larger fairness disparity than excluding Asian, Indian, or White faces.
Embedding-based checks and visualization support the internal consistency of the added annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same annotation process could be applied to other detection benchmarks to check whether the Black-face performance pattern appears across different model families.
The results point to the value of ensuring Black faces are well represented in training sets to limit disparity growth.
Future experiments might test whether the observed gaps persist when the detector is evaluated on images captured under different lighting or pose conditions.

Load-bearing premise

The manual annotations of perceived ethnicity and sex are sufficiently accurate and consistent to support reliable conclusions about model performance differences across groups.

What would settle it

Re-annotating a substantial subset of the faces with a new set of human labelers and retraining the detector to find that the Black-face performance gap disappears or that another group produces a larger disparity increase when excluded.

Figures

Figures reproduced from arXiv: 2606.31704 by Beno\^it Ronval, F\'elicien Schiltz, Maxime Moussi, Siegfried Nijssen.

**Figure 1.** Figure 1: Samples from the WIDER-FAIR dataset The second step is the annotation for the sensitive features. We consider two features: the ethnicity and the sex of the people in the images. For the ethnicity, we distinguish between four values: Asian, Black, Indian, and White. Although more or different values could be considered, we find these four categories to cover a large set of the images from WIDER-FACE, while… view at source ↗

**Figure 3.** Figure 3: Confusion matrix for sex classification two additional values for the sensitive groups: Undetermined, whenever the face was too difficult to classify due to the image quality or simply because the sensitive value was unclear, and Other, when the ethnicity of the presented face did not match any of the four values listed above, for example for Native American. These two values are introduced to allow the an… view at source ↗

**Figure 4.** Figure 4: t-SNE representation of the annotated images [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Performance of the base model with the ethnicity [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 7.** Figure 7: Performance in LOEO Asian 0.5 0.6 0.7 0.8 0.9 IoU Threshold 0.0 0.2 0.4 0.6 0.8 Recall Asian Black Indian White [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 10.** Figure 10: Performance in LOSO female 0.5 0.6 0.7 0.8 0.9 IoU Threshold 0.0 0.2 0.4 0.6 0.8 Recall Female Male [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

read the original abstract

The deployment of face detection models in real-world applications raises important fairness concerns, as these systems may showcase performance disparities across demographic groups. A key obstacle to studying and mitigating such biases is the lack of face detection datasets with sensitive feature annotations. To address this gap, we introduce WIDER-FAIR, a new dataset built on the widely used WIDER-FACE benchmark, manually annotated with the perceived ethnicity and sex of each face. The dataset contains 16,256 images annotated across four ethnic groups: Asian, Black, Indian, and White, and two sex categories. We assess the quality and coherence of the annotations using face embeddings, a K-Nearest Neighbors classifier, and a t-SNE visualization, all of which support the consistency of the labeling process. As a demonstration of the dataset's potential, we train a YOLOv5 model and perform ablation studies on each sensitive feature. Among other findings, our experiments show that detection performance is notably lower for faces of Black individuals, and that excluding this group from training increases fairness disparity more than excluding any other ethnic group. These observations illustrate the value of demographically annotated datasets for understanding and evaluating bias in face detection models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WIDER-FAIR adds ethnicity and sex labels to WIDER-FACE with some ablation results on detection gaps, but the claims hinge on annotation accuracy that the checks do not fully establish.

read the letter

The main point is that the authors manually labeled 16k WIDER-FACE images for four ethnic groups and two sexes, then ran YOLOv5 ablations showing lower performance on Black faces and larger fairness gaps when that group is left out of training.

The useful part is the dataset itself. WIDER-FACE is already a standard benchmark, so adding these labels gives people a ready resource for controlled fairness experiments without building everything from scratch. The embedding, KNN, and t-SNE checks are a reasonable way to show the labels are at least coherent with the visual features.

The soft spot is the label quality. Perceived ethnicity is subjective, and the checks only confirm internal consistency, not correctness against any external reference. There is no reported inter-annotator agreement, no protocol details on handling ambiguous cases, and no error rate estimates. If a meaningful fraction of the Black-labeled faces are misassigned due to lighting, pose, or annotator bias, the ablation finding that excluding Black faces hurts disparity most loses its grounding.

This is for people working on demographic bias in face detection who need a benchmark to test mitigation methods. A reader who wants a concrete starting point for experiments will find it practical.

It should go to peer review rather than desk reject. The dataset contribution is straightforward and the experiments are replicable, even if referees will need to press on the annotation validation.

Referee Report

1 major / 2 minor

Summary. The paper introduces WIDER-FAIR, a manually annotated extension of the WIDER-FACE dataset providing perceived ethnicity (Asian, Black, Indian, White) and sex labels for faces in 16,256 images. Annotations are assessed for coherence via face embeddings, KNN classification, and t-SNE visualization. As a use case, the authors train a YOLOv5 detector and run ablation studies on demographic subsets, reporting notably lower detection performance on Black faces and that excluding the Black group from training increases fairness disparity more than exclusion of any other ethnic group.

Significance. If the ethnicity annotations prove sufficiently accurate, the dataset would be a useful resource for fairness research in face detection, directly addressing the lack of demographically labeled benchmarks. The embedding-based consistency checks are a methodological strength, and the ablation findings on group-specific impacts could inform targeted bias mitigation if the labels hold. The work is otherwise standard in its use of existing detectors and metrics.

major comments (1)

[§3] §3 (Annotation process and validation): The manuscript reports that embeddings + KNN + t-SNE support label consistency, yet provides no inter-annotator agreement statistics, annotation guidelines for ambiguous cases, number of annotators, or error rates against any external reference. Because the central claims (lower AP on Black faces; largest disparity increase when Black group is ablated) rest on the correctness of the four ethnicity labels rather than mere internal coherence, this omission is load-bearing; systematic mislabeling correlated with pose, lighting, or annotator bias would directly confound the §4 ablation results.

minor comments (2)

[Abstract] The abstract states the dataset contains 16,256 images but does not clarify whether this is the full WIDER-FACE subset or a filtered portion; adding this detail would improve reproducibility.
[Figure 3] Figure captions for the t-SNE plots should explicitly state the embedding model and distance metric used, as these choices affect interpretation of cluster separation.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback on the annotation validation. We address the concern regarding §3 below, providing clarification on our approach while acknowledging areas where additional details can be supplied.

read point-by-point responses

Referee: [§3] §3 (Annotation process and validation): The manuscript reports that embeddings + KNN + t-SNE support label consistency, yet provides no inter-annotator agreement statistics, annotation guidelines for ambiguous cases, number of annotators, or error rates against any external reference. Because the central claims (lower AP on Black faces; largest disparity increase when Black group is ablated) rest on the correctness of the four ethnicity labels rather than mere internal coherence, this omission is load-bearing; systematic mislabeling correlated with pose, lighting, or annotator bias would directly confound the §4 ablation results.

Authors: We agree that the manuscript would benefit from more explicit details on the annotation process. The ethnicity and sex labels are perceived attributes assigned based on visual inspection of the images, following standard practices in fairness datasets where no objective ground truth exists. The embedding + KNN + t-SNE analysis was selected specifically to demonstrate internal label coherence via an objective, data-driven method that does not rely on external references (which are unavailable for perceived ethnicity on this scale). We did not compute inter-annotator agreement because the process involved a single primary annotator with review for ambiguous cases, and guidelines were informal visual criteria rather than a formalized document. We will revise §3 to report the number of annotators involved, summarize the annotation criteria used, and expand the discussion of why embedding-based validation was prioritized over IAA for this task. However, we cannot provide error rates against an external reference, as none was collected or available. revision: partial

standing simulated objections not resolved

Error rates against any external reference for the ethnicity labels, as no such reference dataset or ground truth was used or available during annotation.

Circularity Check

0 steps flagged

No circularity; purely empirical annotation and ablation study with no derivations or self-referential fits.

full rationale

The paper introduces manual annotations for ethnicity and sex on WIDER-FACE, validates label coherence via embeddings/KNN/t-SNE (internal consistency check only), trains YOLOv5, and runs group-exclusion ablations. No equations, no fitted parameters renamed as predictions, no self-citation chains, and no ansatz or uniqueness claims appear in the provided text. The central claims rest on new data collection and standard supervised training, making the work self-contained against external benchmarks without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset paper with no theoretical derivations, fitted parameters, or new postulated entities; the annotations rely on standard manual labeling practices.

pith-pipeline@v0.9.1-grok · 5758 in / 1036 out tokens · 27053 ms · 2026-07-01T05:35:08.237745+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency. PMLR, 77–91

2018
[2]

Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. 2020. Retinaface: Single-shot multi-level face localisation in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5203–5212

2020
[3]

Tausif Diwan, G Anirudh, and Jitendra V Tembhurne. 2023. Object detection using YOLO: challenges, architectural successors, datasets and applications.multimedia Tools and Applications82, 6 (2023), 9243–9275

2023
[4]

European Parliament and Council of the European Union. 2024. Regulation (EU) 2024/1689 of 13 June 2024 Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act). https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng

2024
[5]

Kimmo Karkkainen and Jungseock Joo. 2021. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 1548–1558

2021
[6]

Davis E King. 2009. Dlib-ml: A machine learning toolkit.The Journal of Machine Learning Research10 (2009), 1755–1758

2009
[7]

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning.ACM computing surveys (CSUR)54, 6 (2021), 1–35

2021
[8]

Hanna F Menezes, Arthur SC Ferreira, Eanes T Pereira, and Herman M Gomes. 2021. Bias and fairness in face detection. In2021 34th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI). IEEE, 247–254. Proceedings of ECAF’26. September 02 – September 04, 2026. Ghent, BE. 14 • M. Moussi et al

2021
[9]

Surbhi Mittal, Kartik Thakral, Puspita Majumdar, Mayank Vatsa, and Richa Singh. 2023. Are face detection models biased?. In2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 1–7

2023
[10]

Emmanouil Panagiotou, Benoît Ronval, Arjun Roy, Ludwig Bothmann, Bernd Bischl, Siegfried Nijssen, and Eirini Ntoutsi. 2025. TABFAIRGDT: A Fast Fair Tabular Data Generator using Autoregressive Decision Trees.arXiv preprint arXiv:2509.19927(2025)

work page arXiv 2025
[11]

Delong Qi, Weijun Tan, Qi Yao, and Jingfeng Liu. 2022. YOLO5Face: Why reinventing a face detector. InEuropean Conference on Computer Vision. Springer, 228–244

2022
[12]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779–788

2016
[13]

Benoît Ronval, Siegfried Nijssen, and Ludwig Bothmann. 2024. Can generative AI-based data balancing mitigate unfairness issues in machine learning?. InEWAF

2024
[14]

Juan Terven, Diana-Margarita Córdova-Esparza, and Julio-Alejandro Romero-González. 2023. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas.Machine learning and knowledge extraction5, 4 (2023), 1680–1716

2023
[15]

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. Yfcc100m: The new data in multimedia research.Commun. ACM59, 2 (2016), 64–73

2016
[16]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.Journal of machine learning research9, 11 (2008)

2008
[17]

Benjamin Wilson, Judy Hoffman, and Jamie Morgenstern. 2019. Predictive inequity in object detection.arXiv preprint arXiv:1902.11097 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[18]

Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. 2016. Wider face: A face detection benchmark. InProceedings of the IEEE conference on computer vision and pattern recognition. 5525–5533

2016
[19]

Yu Yang, Aayush Gupta, Jianwei Feng, Prateek Singhal, Vivek Yadav, Yue Wu, Pradeep Natarajan, Varsha Hedau, and Jungseock Joo
[20]

InAIES 2022

Enhancing fairness in face detection in computer vision systems by demographic bias mitigation. InAIES 2022. https://www. amazon.science/publications/enhancing-fairness-in-face-detection-in-computer-vision-systems-by-demographic-bias-mitigation

2022
[21]

Stefanos Zafeiriou, Cha Zhang, and Zhengyou Zhang. 2015. A survey on face detection in the wild: past, present and future.Computer Vision and Image Understanding138 (2015), 1–24

2015
[22]

Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks.IEEE signal processing letters23, 10 (2016), 1499–1503. Proceedings of ECAF’26. September 02 – September 04, 2026. Ghent, BE

2016

[1] [1]

Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency. PMLR, 77–91

2018

[2] [2]

Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. 2020. Retinaface: Single-shot multi-level face localisation in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5203–5212

2020

[3] [3]

Tausif Diwan, G Anirudh, and Jitendra V Tembhurne. 2023. Object detection using YOLO: challenges, architectural successors, datasets and applications.multimedia Tools and Applications82, 6 (2023), 9243–9275

2023

[4] [4]

European Parliament and Council of the European Union. 2024. Regulation (EU) 2024/1689 of 13 June 2024 Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act). https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng

2024

[5] [5]

Kimmo Karkkainen and Jungseock Joo. 2021. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 1548–1558

2021

[6] [6]

Davis E King. 2009. Dlib-ml: A machine learning toolkit.The Journal of Machine Learning Research10 (2009), 1755–1758

2009

[7] [7]

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning.ACM computing surveys (CSUR)54, 6 (2021), 1–35

2021

[8] [8]

Hanna F Menezes, Arthur SC Ferreira, Eanes T Pereira, and Herman M Gomes. 2021. Bias and fairness in face detection. In2021 34th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI). IEEE, 247–254. Proceedings of ECAF’26. September 02 – September 04, 2026. Ghent, BE. 14 • M. Moussi et al

2021

[9] [9]

Surbhi Mittal, Kartik Thakral, Puspita Majumdar, Mayank Vatsa, and Richa Singh. 2023. Are face detection models biased?. In2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 1–7

2023

[10] [10]

Emmanouil Panagiotou, Benoît Ronval, Arjun Roy, Ludwig Bothmann, Bernd Bischl, Siegfried Nijssen, and Eirini Ntoutsi. 2025. TABFAIRGDT: A Fast Fair Tabular Data Generator using Autoregressive Decision Trees.arXiv preprint arXiv:2509.19927(2025)

work page arXiv 2025

[11] [11]

Delong Qi, Weijun Tan, Qi Yao, and Jingfeng Liu. 2022. YOLO5Face: Why reinventing a face detector. InEuropean Conference on Computer Vision. Springer, 228–244

2022

[12] [12]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779–788

2016

[13] [13]

Benoît Ronval, Siegfried Nijssen, and Ludwig Bothmann. 2024. Can generative AI-based data balancing mitigate unfairness issues in machine learning?. InEWAF

2024

[14] [14]

Juan Terven, Diana-Margarita Córdova-Esparza, and Julio-Alejandro Romero-González. 2023. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas.Machine learning and knowledge extraction5, 4 (2023), 1680–1716

2023

[15] [15]

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. Yfcc100m: The new data in multimedia research.Commun. ACM59, 2 (2016), 64–73

2016

[16] [16]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.Journal of machine learning research9, 11 (2008)

2008

[17] [17]

Benjamin Wilson, Judy Hoffman, and Jamie Morgenstern. 2019. Predictive inequity in object detection.arXiv preprint arXiv:1902.11097 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[18] [18]

Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. 2016. Wider face: A face detection benchmark. InProceedings of the IEEE conference on computer vision and pattern recognition. 5525–5533

2016

[19] [19]

Yu Yang, Aayush Gupta, Jianwei Feng, Prateek Singhal, Vivek Yadav, Yue Wu, Pradeep Natarajan, Varsha Hedau, and Jungseock Joo

[20] [20]

InAIES 2022

Enhancing fairness in face detection in computer vision systems by demographic bias mitigation. InAIES 2022. https://www. amazon.science/publications/enhancing-fairness-in-face-detection-in-computer-vision-systems-by-demographic-bias-mitigation

2022

[21] [21]

Stefanos Zafeiriou, Cha Zhang, and Zhengyou Zhang. 2015. A survey on face detection in the wild: past, present and future.Computer Vision and Image Understanding138 (2015), 1–24

2015

[22] [22]

Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks.IEEE signal processing letters23, 10 (2016), 1499–1503. Proceedings of ECAF’26. September 02 – September 04, 2026. Ghent, BE

2016