pith. sign in

arxiv: 1907.01427 · v1 · pith:S3HBNGOJnew · submitted 2019-07-02 · 💻 cs.CV · cs.LG

Improving Borderline Adulthood Facial Age Estimation through Ensemble Learning

Pith reviewed 2026-05-25 11:02 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords facial age estimationensemble learningdeep learningborderline adulthoodDEX modelunderage estimationage verificationDS13K model
0
0 comments X

The pith

An ensemble technique with a fine-tuned model reaches 68 percent accuracy for 16-17 year old faces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the persistent weakness of facial age estimation algorithms in the narrow band between non-adulthood and adulthood. The authors combine an ensemble method with a model called DS13K that is fine-tuned on the DEX base model. They report 68 percent accuracy on 16-17 year olds, presented as four times the accuracy of DEX alone in that range. A sympathetic reader would care because better distinction in this borderline supports practical decisions that separate minors from adults. The work also benchmarks several existing cloud and offline age-prediction services.

Core claim

The authors show that an ensemble technique applied to their DS13K deep learning model, after fine-tuning on the DEX model, produces 68 percent accuracy for the 16 to 17 years old age group. This is stated as four times the accuracy achieved by the DEX model for the same group. The approach is motivated by the consistent difficulty existing methods have with borderline adulthood cases, and the paper includes side-by-side evaluation of commercial services such as Amazon Rekognition, Microsoft Azure, How-Old.net, and DEX.

What carries the argument

The ensemble technique applied after fine-tuning the DS13K model on DEX, which aggregates predictions to improve handling of the 16-17 age band.

If this is right

  • The ensemble raises accuracy for 16-17 year olds to 68 percent.
  • Accuracy in the target range is four times higher than that of the DEX model alone.
  • Existing commercial services exhibit the same weakness in the borderline age range as the base DEX model.
  • The method focuses specifically on underage estimation within the broader facial age estimation task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ensemble construction might be tested on other narrow age intervals where single models also fail.
  • Integration into age-verification pipelines could lower misclassification rates for young adults near the legal threshold.
  • Re-running the experiments with publicly fixed train-test splits would clarify how much of the gain is due to the modeling choices versus data handling.

Load-bearing premise

The accuracy gain for the 16-17 group comes from the ensemble method and fine-tuning rather than from dataset selection, split choices, or unreported post-processing steps.

What would settle it

Evaluating the original DEX model on the identical test images and age labels used for the ensemble and obtaining accuracy close to 68 percent for the 16-17 group would falsify the claimed improvement.

Figures

Figures reproduced from arXiv: 1907.01427 by Aikaterini Kanta, Brett A. Becker, David Lillis, Elias Bou-Harb, Felix Anda, Mark Scanlon, Nhien-An Le-Khac.

Figure 1
Figure 1. Figure 1: Average Estimated Age from each Service Compared with Actual Age. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Mean Absolute Error per Age by Service. Service MAE Amazon Rekognition 3.349 How-Old.net 5.281 Microsoft Azure 5.347 (D)eep (EX)pectation 6.936 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average Faces of DS13K Subjects between 16 to 17 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance vs Age Group. Range Logistic Gradient Bagging Regression Boosting Regressor 0-5 0.734 0.703 0.707 6-10 0.575 0.665 0.553 11-15 0.432 0.391 0.441 16-17 0.006 0.609 0.428 18-25 0.867 0.684 0.713 AVG 0.523 0.611 0.569 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Achieving high performance for facial age estimation with subjects in the borderline between adulthood and non-adulthood has always been a challenge. Several studies have used different approaches from the age of a baby to an elder adult and different datasets have been employed to measure the mean absolute error (MAE) ranging between 1.47 to 8 years. The weakness of the algorithms specifically in the borderline has been a motivation for this paper. In our approach, we have developed an ensemble technique that improves the accuracy of underage estimation in conjunction with our deep learning model (DS13K) that has been fine-tuned on the Deep Expectation (DEX) model. We have achieved an accuracy of 68% for the age group 16 to 17 years old, which is 4 times better than the DEX accuracy for such age range. We also present an evaluation of existing cloud-based and offline facial age prediction services, such as Amazon Rekognition, Microsoft Azure Cognitive Services, How-Old.net and DEX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes an ensemble technique that integrates a fine-tuned DS13K deep learning model with the DEX baseline to improve facial age estimation accuracy specifically for the borderline 16-17 age group. It reports achieving 68% accuracy in this range (claimed to be 4x better than DEX) and provides an evaluation of several commercial cloud-based and offline age prediction services.

Significance. If the numerical improvement can be shown to result from the ensemble and fine-tuning rather than from dataset or split choices, the work would address a documented weakness in age estimation for near-adult subjects and could support more reliable systems for age-restricted content and verification tasks.

major comments (2)
  1. [Abstract] Abstract: The headline result of 68% accuracy for the 16-17 group (4x DEX) supplies no information on the size or composition of the test cohort for this bin, the exact definition of accuracy (exact-year match, ±1 year tolerance, etc.), the training/validation protocol, or whether the identical images and splits were used for the DEX baseline comparison. These omissions are load-bearing for the central claim that the gain is produced by the DS13K ensemble + fine-tuning.
  2. [Method / Experiments] Method and Experiments sections: No description is given of the DS13K architecture, the precise fine-tuning procedure and hyperparameters, the ensemble combination rule, or the source and labeling process for the 16-17 training and test images. Without these details the reported accuracy cannot be reproduced or attributed to the proposed technique.
minor comments (1)
  1. [Abstract] Abstract: The cited MAE range of 1.47–8 years from prior work is stated without references to the specific studies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify omissions that weaken the presentation of our central result. We will revise the manuscript to supply the missing information on experimental protocol, data, and implementation details so that the accuracy gain can be properly attributed and reproduced.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline result of 68% accuracy for the 16-17 group (4x DEX) supplies no information on the size or composition of the test cohort for this bin, the exact definition of accuracy (exact-year match, ±1 year tolerance, etc.), the training/validation protocol, or whether the identical images and splits were used for the DEX baseline comparison. These omissions are load-bearing for the central claim that the gain is produced by the DS13K ensemble + fine-tuning.

    Authors: We agree that the abstract as written does not contain these load-bearing details. In the revised version we will expand the abstract to report the number and source distribution of test images in the 16-17 bin, state that accuracy is defined as exact-year match, describe the cross-validation protocol, and explicitly confirm that the DEX baseline was run on the identical images and splits. These additions will allow readers to evaluate whether the reported improvement is attributable to the ensemble and fine-tuning. revision: yes

  2. Referee: [Method / Experiments] Method and Experiments sections: No description is given of the DS13K architecture, the precise fine-tuning procedure and hyperparameters, the ensemble combination rule, or the source and labeling process for the 16-17 training and test images. Without these details the reported accuracy cannot be reproduced or attributed to the proposed technique.

    Authors: We acknowledge that the Method and Experiments sections omit these implementation specifics. The original submission emphasized the ensemble concept and the headline accuracy figure but did not include the required technical description. In revision we will insert a new subsection that specifies the DS13K architecture, the exact fine-tuning schedule and hyperparameters, the ensemble aggregation rule, and the provenance and labeling procedure for the 16-17 images. This will make the experimental protocol reproducible and permit attribution of the accuracy gain to the proposed method. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical accuracy claims are self-contained experimental outcomes

full rationale

The paper presents an empirical ML study reporting measured accuracy on age estimation tasks after training an ensemble model fine-tuned from DEX. No derivation chain, equations, or first-principles results are claimed; the 68% figure is an observed performance metric on a test cohort, not a quantity defined in terms of itself or obtained by fitting a parameter that is then renamed as a prediction. No self-citation is invoked as a uniqueness theorem or load-bearing premise, and no ansatz is smuggled. The result is therefore independent of the inputs in the sense required by the circularity criteria and can be externally falsified on the same datasets.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical performance of a fine-tuned deep network combined via ensemble; this depends on standard but unspecified training choices and data handling that are not visible in the abstract.

free parameters (2)
  • Fine-tuning hyperparameters
    Learning rate, epochs, and other settings used to create DS13K from DEX are fitted to data and affect the final accuracy.
  • Ensemble combination rule
    The method for merging model outputs is chosen or tuned to produce the reported 68% figure.
axioms (1)
  • domain assumption The DEX model provides a viable starting point whose weaknesses can be mitigated by fine-tuning and ensembling
    The paper builds directly on DEX performance without independent verification of its baseline suitability.

pith-pipeline@v0.9.0 · 5725 in / 1421 out tokens · 71240 ms · 2026-05-25T11:02:16.367969+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

  1. [1]

    Felix Anda, David Lillis, Aikaterini Kanta, Brett Becker, Elias Bou-Harb, Nhien An Le Khac, and Mark Scanlon. 2019. Improving the accuracy of automated facial age estimation to aid CSEM investigations. Digital Investigation 28 (2019), S142

  2. [2]

    Felix Anda, David Lillis, Nhien-An Le-Khac, and Mark Scanlon. 2018. Evaluating Automated Facial Age Estimation Techniques for Digital Forensics. In 12th In- ternational Workshop on Systematic Approaches to Digital Forensics Engineering (SADFE), IEEE Security & Privacy Workshops . IEEE

  3. [3]

    Modesto Castrillón-Santana, José Javier Lorenzo Navarro, and Cristina Freire Obregón. 2016. Boys2Men, an age estimation dataset with applications to detect enfants in pornography content. (2016)

  4. [4]

    Shixing Chen, Caojin Zhang, Ming Dong, Jialiang Le, and Mike Rao. 2017. Using Ranking-CNN for Age Estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  5. [5]

    Wenyuan Dai, Ou Jin, Gui-Rong Xue, Qiang Yang, and Yong Yu. 2009. Eigen- Transfer: A Unified Framework for Transfer Learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09) . ACM, New York, NY, USA, 193–200. https://doi.org/10.1145/1553374.1553399

  6. [6]

    Antitza Dantcheva, Carmelo Velardo, Angela D’Angelo, and Jean-Luc Dugelay

  7. [7]

    Multimedia Tools and Appli- cations 51, 2 (01 Jan 2011), 739–777

    Bag of soft biometrics for person identification. Multimedia Tools and Appli- cations 51, 2 (01 Jan 2011), 739–777. https://doi.org/10.1007/s11042-010-0635-7

  8. [8]

    Yuan Dong, Yinan Liu, and Shiguo Lian. 2016. Automatic age estimation based on deep learning algorithm. Neurocomputing 187 (2016), 4–10

  9. [9]

    Eran Eidinger, Roee Enbar, and Tal Hassner. 2014. Age and gender estimation of unfiltered faces. IEEE Transactions on Information Forensics and Security 9, 12 (2014), 2170–2179

  10. [10]

    Jason Farina, Mark Scanlon, Nhien-An Le-Khac, and M-Tahar Kechadi. 2015. Overview of the Forensic Investigation of Cloud Services. In 10th International Conference on A vailability, Reliability and Security (ARES 2015). IEEE, Toulouse, France, 556–565. https://doi.org/10.1109/ARES.2015.81

  11. [11]

    Eilidh Ferguson and Caroline Wilkinson. 2017. Juvenile age estimation from facial images. Science & Justice 57, 1 (2017), 58–62

  12. [12]

    Andrew P Founds, Nick Orlans, Whiddon Genevieve, and Craig I Watson. 2011. Nist special databse 32-multiple encounter dataset ii (meds-ii). NIST Intera- gency/Internal Report (NISTIR)-7807 (2011)

  13. [13]

    Y. Fu, G. Guo, and T. S. Huang. 2010. Age Synthesis and Estimation via Faces: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 11 (Nov 2010), 1955–1976. https://doi.org/10.1109/TPAMI.2010.36

  14. [14]

    Google. 2018. Using AI to help organizations detect and report child sexual abuse material online. https://www.blog.google/around-the-globe/google-europe/ using-ai-help-organizations-detect-and-report-child-sexual-abuse-material-online/

  15. [15]

    Petra Grd and Miroslav Bača. 2016. Creating a face database for age estimation and classification. In Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2016 39th International Convention on. IEEE, 1371–1374

  16. [16]

    Hu Han, Charles Otto, and Anil K Jain. 2013. Age estimation from face images: Human vs. machine performance. In 2013 International Conference on Biometrics (ICB). IEEE, 1–8

  17. [17]

    Juliane A Kloess, Jessica Woodhams, Helen Whittle, Tim Grant, and Catherine E Hamilton-Giachritsis. 2017. The challenges of identifying and classifying child sexual abuse material. Sexual Abuse (2017), 1079063217724768

  18. [18]

    Quan Le, Oisín Boydell, Brian Mac Namee, and Mark Scanlon. 2018. Deep Learning at the Shallow End: Malware Classification for Non-Domain Experts. 26 (07 2018), S118 – S126. https://doi.org/10.1016/j.diin.2018.04.024

  19. [19]

    Gil Levi and Tal Hassner. 2015. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 34–42

  20. [20]

    David Lillis, Brett Becker, Tadhg O’Sullivan, and Mark Scanlon. 2016. Current Challenges and Future Research Areas for Digital Forensic Investigation. In The 11th ADFSL Conference on Digital Forensics, Security and Law (CDFSL 2016) . ADFSL, Daytona Beach, FL, USA, 9–20

  21. [21]

    Khoa Luu, Keshav Seshadri, Marios Savvides, Tien D Bui, and Ching Y Suen

  22. [22]

    In Biometrics (ijcb), 2011 international joint conference on

    Contourlet appearance model for facial age estimation. In Biometrics (ijcb), 2011 international joint conference on . IEEE, 1–8

  23. [23]

    Sumit Mund. 2015. Microsoft azure machine learning . Packt Publishing Ltd

  24. [24]

    P Jonathon Phillips, Harry Wechsler, Jeffery Huang, and Patrick J Rauss. 1998. The FERET database and evaluation procedure for face-recognition algorithms. Image and vision computing 16, 5 (1998), 295–306

  25. [25]

    M Ratnayake, Z Obertová, M Dose, P Gabriel, HM Bröker, M Brauckmann, A Barkus, R Rizgeliene, J Tutkuviene, Stefanie Ritz-Timme, et al. 2014. The juvenile face as a suitable age indicator in child pornography cases: a pilot study on the reliability of automated and visual estimation approaches. International journal of legal medicine 128, 5 (2014), 803–808

  26. [26]

    Rasmus Rothe, Radu Timofte, and Luc Van Gool. 2016. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision (IJCV) (July 2016)

  27. [27]

    Chris Seiffert, Taghi M Khoshgoftaar, Jason Van Hulse, and Amri Napolitano

  28. [28]

    In Pattern Recognition, 2008

    RUSBoost: Improving classification performance when training data is skewed. In Pattern Recognition, 2008. ICPR 2008. 19th International Conference on . IEEE, 1–4

  29. [29]

    Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Commun. ACM 59, 2 (2016), 64–73

  30. [30]

    Frank Wallhoff. 2006. Facial Expressions and Emotions Database. http:// www-prima.inrialpes.fr/FGnet/html/home.html

  31. [31]

    Sun-Chong Wang. 2003. Artificial neural network. In Interdisciplinary computing in java programming. Springer, 81–100

  32. [32]

    Economy Watch. 2010. US Economy. Economy Watch (2010)

  33. [33]

    Heidi Weber, António Cruz Rodrigues, and Américo Mateus. 2016. Emotion and Mood in Design Thinking. Design Doctoral Conference’16: TRANSversality - Proceedings of the DDC 3rd Conference July (2016), 65–72

  34. [34]

    Song Yang Zhang, Zhifei and Hairong Qi. 2017. Age Progression/Regression by Conditional Adversarial Autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . IEEE