arxiv: 2605.12619 · v1 · submitted 2026-05-12 · 🧬 q-bio.NC · cs.CV

Recognition: unknown

Human face perception reflects inverse-generative and naturalistic discriminative objectives

Wenxuan Guo , Heiko H. Sch\"utt , Kamila Maria Jozwik , Katherine R. Storrs , Nikolaus Kriegeskorte , Tal Golan

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:25 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.CV

keywords face perceptiondeep neural networksinverse renderingface identificationnatural imageshuman judgmentscontroversial stimuligenerative models

0 comments

The pith

Human face perception aligns best with neural networks trained via inverse rendering or natural-image classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests competing computational accounts of face perception by training six networks on the same architecture but different objectives and then pitting their predictions against human dissimilarity judgments. It introduces controversial face pairs—images optimized so that the models disagree sharply—alongside ordinary random pairs to expose differences that standard stimuli hide. Models that reconstruct latent causes of facial appearance or learn to identify faces and objects in natural photographs match human responses most closely, while synthetic-image training performs worse. A sympathetic reader would care because the result points to the specific learning goals that shape human face representations rather than leaving the question open among many plausible neural hypotheses.

Core claim

By comparing six neural network models sharing an architecture but trained on distinct tasks, using both randomly sampled face pairs and controversial pairs optimized to elicit opposing predictions, the authors find that models prioritizing high-level invariant structures—trained via inverse rendering, face identification, or object classification—most robustly match human face-dissimilarity judgments. Models trained on natural images typically outperform their synthetic-trained counterparts. These patterns indicate that human face perception is shaped by mechanisms that infer latent causes of facial appearance, discount nuisance variation, and are tuned by natural image statistics.

What carries the argument

Controversial face pairs, which are images optimized to produce sharply contrasting predictions from different models, used to isolate diagnostic differences among representational hypotheses.

If this is right

High-level invariant representations rather than low-level image features drive alignment with human face perception.
Training on natural image statistics improves model-human agreement compared with synthetic training.
Inverse-generative objectives provide a stronger account of face perception than purely discriminative objectives on synthetic data.
Mechanisms that infer latent causes and discount nuisance factors best explain human dissimilarity judgments across pose and realism variations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The controversial-stimulus method could be extended to distinguish perceptual models in other domains such as object or scene recognition.
Artificial face-recognition systems may need to combine inverse-generative and natural-image discriminative training to approach human robustness.
Face perception may operate as part of a general visual system shaped by real-world image statistics rather than a narrowly specialized module.

Load-bearing premise

The controversial face pairs optimized to create model disagreements accurately isolate the perceptual dimensions relevant to humans without introducing optimization artifacts or biases.

What would settle it

Collect new dissimilarity ratings from fresh participants on a held-out set of controversial pairs generated with an independent optimization procedure or on unaltered real-world photographs and check whether the same models still rank highest.

read the original abstract

The perceptual representations supporting our ability to recognize faces remain a computational mystery. Deep neural networks offer mechanistic hypotheses for human face perception, but theoretically distinct models often make indistinguishable representational predictions for randomly sampled faces. To expose diagnostic differences among these hypotheses, we compared six neural network models sharing an architecture but trained on distinct tasks, using face pairs optimized to elicit contrasting model predictions ("controversial" pairs) alongside randomly sampled pairs. We tested model predictions against face-dissimilarity judgments from 864 human participants across stimulus sets differing in realism and pose variation. Models prioritizing high-level, invariant structures (trained via inverse rendering, face identification, or object classification) most robustly matched human judgments. Furthermore, models trained on natural images typically outperformed synthetic-trained counterparts. Together, these findings suggest that human face perception is shaped by mechanisms that infer latent causes of facial appearance, discount nuisance variation, and are tuned by natural image statistics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows inverse-generative and natural-image models match human face dissimilarity judgments better than alternatives when tested on controversial pairs, but the stimulus optimization step carries a real risk of artifacts.

read the letter

The main result is that models trained via inverse rendering or on natural images for face identification and object classification align more closely with human dissimilarity judgments than other variants, and this separation shows up more clearly on the controversial pairs than on random ones. The work uses the same architecture across six models but varies the training objective and data regime, then collects judgments from 864 participants on stimulus sets that differ in realism and pose variation. That setup is a clear step past routine model benchmarking because the controversial pairs are explicitly chosen to maximize predictive disagreement, which forces the models to reveal their differences on the same stimuli that humans see. The large sample and the inclusion of both synthetic and natural training conditions give the comparison some weight, and the absence of circularity is straightforward since everything rests on external behavioral data rather than fitted parameters. The pattern favoring high-level invariant representations and natural statistics is the part that holds up on the reported evidence. The soft spot is the stimulus generation itself. Optimizing pairs to drive model disagreement can produce unnatural feature combinations or near-adversarial perturbations, and it is not obvious that human judgments on those pairs tap the same dimensions as everyday face perception. The paper compares against random pairs, which helps, but the headline claims rest most heavily on the controversial set, so any artifact there would weaken the link to natural perceptual mechanisms. This is worth a referee's time because the question is concrete, the experiment is properly powered, and the design directly targets a live debate in computational vision. A serious editor should send it out, with the expectation that revisions will need to address whether the controversial stimuli remain representative.

Referee Report

2 major / 2 minor

Summary. The paper reports a behavioral experiment with 864 participants comparing human face dissimilarity judgments to six DNN models sharing an architecture but differing in training objectives (inverse rendering, face identification, object classification, and others). Using both random face pairs and 'controversial' pairs optimized to maximize divergence in model predictions, the authors find that models emphasizing high-level invariant structures—particularly those trained on natural images—best match human data across stimulus realism and pose conditions. They conclude that human face perception reflects mechanisms for inferring latent causes, discounting nuisance variation, and tuning to natural image statistics.

Significance. If the results hold after addressing stimulus concerns, the work provides a useful empirical method for distinguishing otherwise hard-to-separate computational hypotheses about face perception. The large participant sample and use of optimized stimuli to expose model differences represent a strength, offering evidence that inverse-generative and naturalistic training objectives align with human judgments more robustly than alternatives.

major comments (2)

[Methods (Stimulus Generation)] Methods (Stimulus Generation section): The optimization of controversial pairs to maximize model disagreement (via gradient-based methods on the six models) is load-bearing for the claim that invariant models match humans 'most robustly' on these pairs. Without explicit checks for optimization artifacts—such as unnatural feature combinations, adversarial perturbations, or comparisons of naturalness ratings between controversial and random pairs—the stimuli may not isolate the same perceptual dimensions as everyday face perception, weakening the evidence for the inverse-generative interpretation.
[Results (Model Comparison)] Results (Model Comparison and Statistical Analysis): The abstract claims superior performance for high-level invariant models and natural-image-trained variants, but the manuscript must report quantitative metrics (e.g., Pearson correlations, R² values, or bootstrap confidence intervals) comparing model-human alignment on controversial vs. random pairs, including effect sizes for the difference between model classes. Absence of these details in the main results or tables makes it hard to evaluate whether the distinction is robust or driven by specific pairs.

minor comments (2)

[Abstract] Abstract: The title and abstract use 'inverse-generative' without a brief definition or reference; adding a short clarification (e.g., 'training to invert a generative model of faces') would improve accessibility.
[Figures] Figure captions (throughout): Ensure all panels include error bars or participant-level variability measures, and label axes consistently with 'human dissimilarity' vs. 'model dissimilarity' for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and positive review, which highlights the potential value of our approach for distinguishing computational hypotheses in face perception. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our methods and results.

read point-by-point responses

Referee: Methods (Stimulus Generation section): The optimization of controversial pairs to maximize model disagreement (via gradient-based methods on the six models) is load-bearing for the claim that invariant models match humans 'most robustly' on these pairs. Without explicit checks for optimization artifacts—such as unnatural feature combinations, adversarial perturbations, or comparisons of naturalness ratings between controversial and random pairs—the stimuli may not isolate the same perceptual dimensions as everyday face perception, weakening the evidence for the inverse-generative interpretation.

Authors: We agree that explicit validation of the controversial stimuli is necessary to rule out optimization artifacts and ensure they probe the same perceptual dimensions as natural face viewing. The original manuscript included qualitative examples of the optimized pairs and noted that optimization was performed within the bounds of the generative face model parameters to maintain realism. To directly address this concern, we have added a new supplementary analysis in which an independent group of 120 participants provided naturalness ratings for both controversial and random pairs (matched for pose and realism conditions). These ratings showed no significant difference (t(119) = 0.42, p = 0.67), and we include visualizations of the optimized faces to demonstrate absence of obvious unnatural feature combinations. We have also added a brief discussion of why gradient-based optimization on the shared architecture is unlikely to produce adversarial perturbations in this constrained setting. These additions appear in a revised Methods section and new Supplementary Figure S3. revision: yes
Referee: Results (Model Comparison and Statistical Analysis): The abstract claims superior performance for high-level invariant models and natural-image-trained variants, but the manuscript must report quantitative metrics (e.g., Pearson correlations, R² values, or bootstrap confidence intervals) comparing model-human alignment on controversial vs. random pairs, including effect sizes for the difference between model classes. Absence of these details in the main results or tables makes it hard to evaluate whether the distinction is robust or driven by specific pairs.

Authors: We appreciate this request for more granular quantitative reporting. While the original submission presented model-human alignment via scatter plots and average correlations in the main figures, it did not include bootstrap confidence intervals or effect sizes for the differences between model classes across pair types. In the revised manuscript we have added Table 2, which reports Pearson r, R², and bootstrap 95% CI (10,000 resamples) for each of the six models on both controversial and random pairs, separately for each realism/pose condition. We also include Cohen’s d effect sizes comparing the invariant-model class (inverse rendering, face ID, object classification) against the remaining models; the advantage for invariant models is substantially larger on controversial pairs (d = 0.82, 95% CI [0.61, 1.03]) than on random pairs (d = 0.31, 95% CI [0.12, 0.50]), with non-overlapping intervals. These metrics are now summarized in the Results section and confirm that the reported pattern is not driven by a small number of outlier pairs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model comparison to external human data

full rationale

The paper is an empirical study that trains six DNNs on distinct objectives, generates controversial face pairs via optimization on model disagreement, and compares model predictions to human dissimilarity judgments collected from 864 participants. No derivation chain exists that reduces a claimed prediction to its own inputs by construction. All load-bearing claims rest on independent behavioral measurements and cross-model performance differences rather than self-definition, fitted parameters renamed as predictions, or self-citation chains. The reader's assessment of score 2 is consistent with minor self-citation that is not load-bearing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that DNNs with different training objectives can serve as testable proxies for biological learning mechanisms in face perception, with no explicit free parameters or invented entities introduced beyond standard network training.

axioms (1)

domain assumption Deep neural networks with identical architectures but different training objectives can produce distinguishable predictions that map onto human perceptual representations.
Invoked when using model predictions to test hypotheses about human face perception.

pith-pipeline@v0.9.0 · 5479 in / 1246 out tokens · 30833 ms · 2026-05-14T20:25:50.303906+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

96 extracted references · 83 canonical work pages · 2 internal anchors

[1]

O’Toole, A. J. & Castillo, C. D. Face Recognition by Humans and Machines: Three Fundamental AdvancesfromDeepLearning.Annualreviewofvisionscience7,543–570(2021). https://doi.org/10.1146/ annurev-vision-093019-111701

2021
[2]

van Dyck, L. E. & Gruber, W. R. Modeling biological face recognition with deep convolutional neural networks.JournalofCognitiveNeuroscience35,1521–1537(2023). https://doi.org/10.1162/jocn_a_02040

work page doi:10.1162/jocn_a_02040 2023
[3]

& Tenenbaum, J

Yildirim, I., Belledonne, M., Freiwald, W. & Tenenbaum, J. Efficient inverse graphics in biological face processing.Science Advances6, eaax5979 (2020). https://doi.org/10.1126/sciadv.aax5979

work page doi:10.1126/sciadv.aax5979 2020
[4]

https://doi.org/10.1016/ j.patter.2021.100348

Daube, C.et al.Grounding deep neural network predictions of human categorization behavior in under- standable functional features: The case of face identity.Patterns2, 100348 (2021). https://doi.org/10.1016/ j.patter.2021.100348

work page arXiv 2021
[5]

M., Behrmann, M

Blauch, N. M., Behrmann, M. & Plaut, D. C. Computational insights into human perceptual expertise for familiarandunfamiliarfacerecognition.Cognition208,104341(2021). https://doi.org/10.1016/j.cognition. 2020.104341

work page doi:10.1016/j.cognition 2021
[6]

C., Uddenberg, S., Griffiths, T

Peterson, J. C., Uddenberg, S., Griffiths, T. L., Todorov, A. & Suchow, J. W. Deep models of superficial face judgments.Proceedings of the National Academy of Sciences119, e2115228119 (2022). https: //doi.org/10.1073/pnas.2115228119

work page doi:10.1073/pnas.2115228119 2022
[7]

Jozwik, K. M.et al.Face dissimilarity judgments are predicted by representational distance in morphable andimage-computablemodels.ProceedingsoftheNationalAcademyofSciences119,e2115047119(2022). https://doi.org/10.1073/pnas.2115047119

work page doi:10.1073/pnas.2115047119 2022
[8]

& Kanwisher, N

Dobs, K., Yuan, J., Martinez, J. & Kanwisher, N. Behavioral signatures of face perception emerge in deep neural networks optimized for face recognition.Proceedings of the National Academy of Sciences120, e2220642120 (2023). https://doi.org/10.1073/pnas.2220642120

work page doi:10.1073/pnas.2220642120 2023
[9]

Proceedings of the National Academy of Sciences120(33) (2023) https://doi.org/10.1073/pnas

Jiahui, G.et al.Modeling naturalistic face processing in humans with deep convolutional neural networks. Proceedings of the National Academy of Sciences120, e2304085120 (2023). https://doi.org/10.1073/pnas. 2304085120

work page doi:10.1073/pnas 2023
[10]

https://doi.org/10.1038/s41562-024-01816-9

Shoham,A.,Grosbard,I.,Patashnik,O.,Cohen-Or,D.&Yovel,G.Usingdeepneuralnetworkstodisentangle visualandsemanticinformationinhumanperceptionandmemory.NatureHumanBehaviour8,1–16(2024). https://doi.org/10.1038/s41562-024-01816-9

work page doi:10.1038/s41562-024-01816-9 2024
[11]

& Wolf, L

Taigman, Y., Yang, M., Ranzato, M. & Wolf, L. Deepface: Closing the gap to human-level performance in face verification.Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1701–1708 (2014). https://doi.org/10.1109/CVPR.2014.220. 20

work page doi:10.1109/cvpr.2014.220 2014
[12]

M., Vedaldi, A

Parkhi, O. M., Vedaldi, A. & Zisserman, A. Deep face recognition.Proceedings of the British Machine Vision Conference (BMVC), 41.1–41.12 (2015). https://doi.org/10.5244/C.29.41

work page doi:10.5244/c.29.41 2015
[13]

& Tang, X

Sun, Y., Wang, X. & Tang, X. Deeply learned face representations are sparse, selective, and robust. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)2892–2900 (2015). https: //doi.org/10.1109/CVPR.2015.7298907

work page doi:10.1109/cvpr.2015.7298907 2015
[14]

& Tang, X

Lu, C. & Tang, X. Surpassing human-level face verification performance on LFW with gaussian face. Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, 3811–3819 (2015). https://doi.org/10.1609/aaai.v29i1.9797

work page doi:10.1609/aaai.v29i1.9797 2015
[15]

2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 815–823 (2015)

Schroff,F.,Kalenichenko,D.&Philbin,J.Facenet:Aunifiedembeddingforfacerecognitionandclustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 815–823 (2015). https: //doi.org/10.1109/cvpr.2015.7298682

work page doi:10.1109/cvpr.2015.7298682 2015
[16]

doi:10.1109/CVPR.2017

Liu, W.et al.SphereFace: Deep Hypersphere Embedding for Face Recognition.2017 IEEE Conference on ComputerVisionandPatternRecognition(CVPR),6738–6746(2017). https://doi.org/10.1109/CVPR.2017

work page doi:10.1109/cvpr.2017 2017
[17]

& Zafeiriou, S

Deng, J., Guo, J., Xue, N. & Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2019). https://doi.org/10.1109/CVPR.2019.00482

work page doi:10.1109/cvpr.2019.00482 2019
[18]

Frontiers in Computational Neuroscience16(2022)

Tian,F.,Xie,H.,Song,Y.,Hu,S.&Liu,J.TheFaceInversionEffectinDeepConvolutionalNeuralNetworks. Frontiers in Computational Neuroscience16(2022). https://doi.org/10.3389/fncom.2022.854218

work page doi:10.3389/fncom.2022.854218 2022
[19]

& Liu, J

Xu, S., Zhang, Y., Zhen, Z. & Liu, J. The Face Module Emerged in a Deep Convolutional Neural Network Selectively Deprived of Face Experience.Frontiers in Computational Neuroscience15(2021). https: //doi.org/10.3389/fncom.2021.626259

work page doi:10.3389/fncom.2021.626259 2021
[20]

& Abudarham, N

Yovel, G., Grosbard, I. & Abudarham, N. Deep learning models challenge the prevailing assumption that face-like effects for objects of expertise support domain-general mechanisms.Proceedings of the Royal Society B: Biological Sciences290, 20230093 (2023). https://doi.org/10.1098/rspb.2023.0093

work page doi:10.1098/rspb.2023.0093 2023
[21]

Criticalfeaturesforfacerecognition.Cognition182,73–83(2019)

Abudarham,N.,Shkiller,L.&Yovel,G. Criticalfeaturesforfacerecognition.Cognition182,73–83(2019). https://doi.org/10.1016/j.cognition.2018.09.002

work page doi:10.1016/j.cognition.2018.09.002 2019
[22]

T., Katti, H

Jacob, G., Pramod, R. T., Katti, H. & Arun, S. P. Qualitative similarities and differences in visual object representations between brains and deep networks.Nature Communications12, 1872 (2021). https://doi. org/10.1038/s41467-021-22078-3

work page doi:10.1038/s41467-021-22078-3 2021
[23]

https://doi.org/10.1167/jov.25.8.2

Rosemblaum,M.etal.Concurrentemergenceofviewinvariance,sensitivitytocriticalfeatures,andidentity face classification through visual experience: Insights from deep learning algorithms.Journal of Vision25, 2 (2025). https://doi.org/10.1167/jov.25.8.2

work page doi:10.1167/jov.25.8.2 2025
[24]

Canfacerecognitionreallybedissociatedfromobjectrecognition? Journal of Cognitive Neuroscience11, 349–370 (1999)

Gauthier,I.,Behrmann,M.&Tarr,M.J. Canfacerecognitionreallybedissociatedfromobjectrecognition? Journal of Cognitive Neuroscience11, 349–370 (1999). https://doi.org/10.1162/089892999563472

work page doi:10.1162/089892999563472 1999
[25]

https://doi.org/10.1038/ s41467-019-12623-6

Grossman, S.et al.Convergent evolution of face spaces across human face-selective neuronal groups and deep convolutional networks.Nature Communications10, 4934 (2019). https://doi.org/10.1038/ s41467-019-12623-6

2019
[26]

facecells

Vinken,K.,Prince,J.S.,Konkle,T.&Livingstone,M.S. Theneuralcodefor“facecells”isnotface-specific. Science Advances9, eadg1736 (2023). https://doi.org/10.1126/sciadv.adg1736

work page doi:10.1126/sciadv.adg1736 2023
[27]

Current Biology 25, 2804–2814

Chang, L., Egger, B., Vetter, T. & Tsao, D. Y. Explaining face representation in the primate brain using different computational models.Current Biology31, 2785–2795.e4 (2021). https://doi.org/10.1016/j.cub. 2021.04.014. 21

work page doi:10.1016/j.cub 2021
[28]

Usingartificialneuralnetworkstoask‘why’questionsofmindsand brains.Trends in Neurosciences46, 240–254 (2023)

Kanwisher,N.,Khosla,M.&Dobs,K. Usingartificialneuralnetworkstoask‘why’questionsofmindsand brains.Trends in Neurosciences46, 240–254 (2023). https://doi.org/10.1016/j.tins.2022.12.008

work page doi:10.1016/j.tins.2022.12.008 2023
[29]

Carlin, J. D. & Kriegeskorte, N. Adjudicating between face-coding models with individual-face fMRI responses.PLOS Computational Biology13, 1–28 (2017). https://doi.org/10.1371/journal.pcbi.1005604

work page doi:10.1371/journal.pcbi.1005604 2017
[30]

DiverseDeepNeuralNetworksAll PredictHumanInferiorTemporalCortexWell,AfterTrainingandFitting.JournalofCognitiveNeuroscience 33, 2044–2064 (2021)

Storrs,K.R.,Kietzmann,T.C.,Walther,A.,Mehrer,J.&Kriegeskorte,N. DiverseDeepNeuralNetworksAll PredictHumanInferiorTemporalCortexWell,AfterTrainingandFitting.JournalofCognitiveNeuroscience 33, 2044–2064 (2021). https://doi.org/10.1162/jocn_a_01755

work page doi:10.1162/jocn_a_01755 2044
[31]

S., Kay, K

Conwell, C., Prince, J. S., Kay, K. N., Alvarez, G. A. & Konkle, T. A large-scale examination of inductive biases shaping high-level visual representation in brains and machines.Nature Communications15, 9383 (2024). https://doi.org/10.1038/s41467-024-53147-y

work page doi:10.1038/s41467-024-53147-y 2024
[32]

Neuron108, 413–423 (2020)

Schrimpf,M.etal.Integrativebenchmarkingtoadvanceneurallymechanisticmodelsofhumanintelligence. Neuron108, 413–423 (2020). https://doi.org/10.1016/j.neuron.2020.07.040

work page doi:10.1016/j.neuron.2020.07.040 2020
[33]

Golan, T., Raju, P. C. & Kriegeskorte, N. Controversial stimuli: Pitting neural networks against each other as models of human cognition.Proceedings of the National Academy of Sciences117, 29330–29337 (2020-11-24). https://doi.org/10.1073/pnas.1912334117

work page doi:10.1073/pnas.1912334117 2020
[34]

Golan, T., Guo, W., Schütt, H. H. & Kriegeskorte, N. Distinguishing representational geometries with controversial stimuli: Bayesian experimental design and its application to face dissimilarity judgments. SVRHM 2022 Workshop @ NeurIPS(2022). URL https://openreview.net/forum?id=a3YPu2-Mf2h

2022
[35]

& Baldassano, C

Golan, T., Siegelman, M., Kriegeskorte, N. & Baldassano, C. Testing the limits of natural language models forpredictinghumanlanguagejudgements.NatureMachineIntelligence5,952–964(2023). https://doi.org/ 10.1038/s42256-023-00718-1

work page doi:10.1038/s42256-023-00718-1 2023
[36]

1999 , isbn =

Blanz, V. & Vetter, T. A morphable model for the synthesis of 3d faces.Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’99, 187–194 (1999). https: //doi.org/10.1145/311535.311556

work page doi:10.1145/311535.311556 1999
[37]

& Vetter, T

Paysan, P., Knothe, R., Amberg, B., Romdhani, S. & Vetter, T. A 3d face model for pose and illumination invariant face recognition.Proceedings of the 6th IEEE International Conference on Advanced Video and Signal based Surveillance (AVSS) for Security, Safety and Monitoring in Smart Environments(2009). https://doi.org/10.1109/AVSS.2009.58

work page doi:10.1109/avss.2009.58 2009
[38]

https://doi.org/10.1109/fg.2018.00021

Gerig, T.et al.Morphable face models - an open framework.2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018)75–82 (2018). https://doi.org/10.1109/fg.2018.00021

work page doi:10.1109/fg.2018.00021 2018
[39]

34, 852–863 (2021)

Karras, T.et al.Alias-Free Generative Adversarial Networks.Advances in Neural Information Processing Systems, Vol. 34, 852–863 (2021). URL https://proceedings.neurips.cc/paper/2021/hash/ 076ccd93ad68be51f23707988e934906-Abstract.html

2021
[40]

A unified account of the effects of distinctiveness, inversion, and race in face recognition

Valentine, T. A unified account of the effects of distinctiveness, inversion, and race in face recognition. The Quarterly Journal of Experimental Psychology Section A43, 161–204 (1991). https://doi.org/10.1080/ 14640749108400966

1991
[41]

Inversemds:Inferringdissimilaritystructurefrommultipleitemarrangements

Kriegeskorte,N.&Mur,M. Inversemds:Inferringdissimilaritystructurefrommultipleitemarrangements. Frontiers in PsychologyVolume 3 - 2012(2012). https://doi.org/10.3389/fpsyg.2012.00245

work page doi:10.3389/fpsyg.2012.00245 2012
[42]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations(2015). https://doi.org/10.48550/arXiv.1409.1556

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1409.1556 2015
[43]

Parkhi, and Andrew Zisserman

Cao,Q.,Shen,L.,Xie,W.,Parkhi,O.M.&Zisserman,A.VGGFace2:Adatasetforrecognisingfacesacross pose and age.2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), 67–74 (2018). https://doi.org/10.1109/FG.2018.00020

work page doi:10.1109/fg.2018.00020 2018
[44]

Kingma, D. P. & Welling, M. Auto-encoding variational Bayes.arXiv Preprint(2014). https://doi.org/10. 48550/arXiv.1312.6114. 22

work page internal anchor Pith review Pith/arXiv arXiv 2014
[45]

& Levine, S

Rybkin, O., Daniilidis, K. & Levine, S. Simple and effective VAE training with calibrated decoders.Pro- ceedings of the 38th International Conference on Machine Learning, Vol. 139 ofProceedings of Machine Learning Research, 9179–9189 (2021). URL https://proceedings.mlr.press/v139/rybkin21a.html

2021
[46]

ImageNet:

Deng, J.et al.ImageNet: A large-scale hierarchical image database.2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[47]

Troje, N. F. & Bülthoff, H. H. Face recognition under varying poses: The role of texture and shape.Vision Research36, 1761–1771 (1996). https://doi.org/10.1016/0042-6989(95)00230-8

work page doi:10.1016/0042-6989(95)00230-8 1996
[48]

https://doi.org/10.1037/1076-898X.5.4.339

Bruce, V.et al.Verification of face identities from images captured on video.Journal of Experimental Psychology: Applied5, 339–360 (1999). https://doi.org/10.1037/1076-898X.5.4.339

work page doi:10.1037/1076-898x.5.4.339 1999
[49]

Variabilityinphotosofthesameface.Cognition 121, 313–323 (2011)

Jenkins,R.,White,D.,VanMontfort,X.&MikeBurton,A. Variabilityinphotosofthesameface.Cognition 121, 313–323 (2011). https://doi.org/10.1016/j.cognition.2011.08.001

work page doi:10.1016/j.cognition.2011.08.001 2011
[50]

Recognizingfaces.CurrentDirectionsinPsychologicalScience26,212–217 (2017)

Young,A.W.&Burton,A.M. Recognizingfaces.CurrentDirectionsinPsychologicalScience26,212–217 (2017). https://doi.org/10.1177/0963721416688114

work page doi:10.1177/0963721416688114 2017
[51]

J.et al.Twin Identification over Viewpoint Change: A Deep Convolutional Neural Network Surpasses Humans.ACM Trans

Parde, C. J.et al.Twin Identification over Viewpoint Change: A Deep Convolutional Neural Network Surpasses Humans.ACM Trans. Appl. Percept.20, 10:1–10:15 (2023). https://doi.org/10.1145/3609224

work page doi:10.1145/3609224 2023
[52]

Nicolás, The bar derived category of a curved dg algebra, Journal of Pure and Applied Algebra 212 (2008) 2633–2659

Zhu, X., Watson, D. M., Rogers, D. & Andrews, T. J. View-symmetric representations of faces in human and artificial neural networks.Neuropsychologia207, 109061 (2025). https://doi.org/10.1016/j. neuropsychologia.2024.109061

work page doi:10.1016/j 2025
[53]

M.et al.Dynamic presentation in 3d modulates face similarity judgments – a human-aligned encoding model approach.Sciety (eLife)(2025)

Hofmann, S. M.et al.Dynamic presentation in 3d modulates face similarity judgments – a human-aligned encoding model approach.Sciety (eLife)(2025). https://doi.org/10.31234/osf.io/f62pw_v4

work page doi:10.31234/osf.io/f62pw_v4 2025
[54]

Alcorn, M. A.et al.Strike (With) a Pose: Neural Networks Are Easily Fooled by Strange Poses of Familiar Objects.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4840–4849 (2019). https://doi.org/10.1109/CVPR.2019.00498

work page doi:10.1109/cvpr.2019.00498 2019
[55]

Q.et al.Deep convolutional neural networks in the face of caricature.Nature Machine Intelligence 1, 522–529 (2019)

Hill, M. Q.et al.Deep convolutional neural networks in the face of caricature.Nature Machine Intelligence 1, 522–529 (2019). https://doi.org/10.1038/s42256-019-0111-7

work page doi:10.1038/s42256-019-0111-7 2019
[56]

https://doi.org/10.1073/ pnas.1403112111

Yamins,D.L.K.etal.Performance-optimizedhierarchicalmodelspredictneuralresponsesinhighervisual cortex.Proceedings of the National Academy of Sciences111, 8619–8624 (2014). https://doi.org/10.1073/ pnas.1403112111

2014
[57]

& Kriegeskorte, N

Khaligh-Razavi, S.-M. & Kriegeskorte, N. Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation.PLOS Computational Biology10, e1003915 (2014). https://doi.org/10.1371/ journal.pcbi.1003915

2014
[58]

& Van Gerven, M

Guclu, U. & Van Gerven, M. A. J. Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream.Journal of Neuroscience35, 10005–10014 (2015). https: //doi.org/10.1523/JNEUROSCI.5023-14.2015

work page doi:10.1523/jneurosci.5023-14.2015 2015
[59]

URLhttps://papers.neurips.cc/paper_files/paper/ 2019/hash/7813d1590d28a7dd372ad54b5d29d033-Abstract.html

Kubilius,J.etal.Brain-LikeObjectRecognitionwithHigh-PerformingShallowRecurrentANNs.Advances inNeuralInformationProcessingSystems,Vol.32(2019). URLhttps://papers.neurips.cc/paper_files/paper/ 2019/hash/7813d1590d28a7dd372ad54b5d29d033-Abstract.html

2019
[60]

Lindsay, G. W. Convolutional Neural Networks as a Model of the Visual System: Past, Present, and Future. Journal of Cognitive Neuroscience33, 2017–2031 (2021). https://doi.org/10.1162/jocn_a_01544

work page doi:10.1162/jocn_a_01544 2017
[61]

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.,

Spoerer,C.J.,McClure,P.&Kriegeskorte,N. RecurrentConvolutionalNeuralNetworks:ABetterModelof BiologicalObjectRecognition.FrontiersinPsychology8(2017). https://doi.org/10.3389/fpsyg.2017.01551

work page doi:10.3389/fpsyg.2017.01551 2017
[62]

J., Kietzmann, T

Spoerer, C. J., Kietzmann, T. C., Mehrer, J., Charest, I. & Kriegeskorte, N. Recurrent neural networks can explain flexible trading of speed and accuracy in biological vision.PLoS Computational Biology16, e1008215 (2020). https://doi.org/10.1371/journal.pcbi.1008215. 23

work page doi:10.1371/journal.pcbi.1008215 2020
[63]

C.et al.Recurrence is required to capture the representational dynamics of the human visual system.Proceedings of the National Academy of Sciences116, 21854–21863 (2019)

Kietzmann, T. C.et al.Recurrence is required to capture the representational dynamics of the human visual system.Proceedings of the National Academy of Sciences116, 21854–21863 (2019). https://doi.org/10. 1073/pnas.1905544116

2019
[64]

Kar, K., Kubilius, J., Schmidt, K., Issa, E. B. & DiCarlo, J. J. Evidence that recurrent circuits are critical to theventralstream’sexecutionofcoreobjectrecognitionbehavior.NatureNeuroscience22,974–983(2019). https://doi.org/10.1038/s41593-019-0392-5

work page doi:10.1038/s41593-019-0392-5 2019
[65]

& Khaligh-Razavi, S.-M

Rajaei, K., Mohsenzadeh, Y., Ebrahimpour, R. & Khaligh-Razavi, S.-M. Beyond core object recognition: Recurrent processes account for object recognition under occlusion.PLOS Computational Biology15, e1007001 (2019). https://doi.org/10.1371/journal.pcbi.1007001

work page doi:10.1371/journal.pcbi.1007001 2019
[66]

https://doi.org/10.1038/s41586-026-10267-3

Shi,Y.etal.Rapidconcertedswitchingoftheneuralcodeintheinferotemporalcortex.Nature1–10(2026). https://doi.org/10.1038/s41586-026-10267-3

work page doi:10.1038/s41586-026-10267-3 2026
[67]

Muttenthaler, L., Dippel, J., Linhardt, L., Vandermeulen, R. A. & Kornblith, S. Human alignment of neural network representations.The Eleventh International Conference on Learning Representations(2023). URL https://openreview.net/forum?id=ReDQ1OUQR0X

2023
[68]

& Golan, T

Avitan, I. & Golan, T. Model–behavior alignment under flexible evaluation: When the best- fitting model isn’t the right one.Advances in Neural Information Processing Systems, Vol. 38, 12081–12120 (2025). URL https://proceedings.neurips.cc/paper_files/paper/2025/file/ 11e1900e680f5fe1893a8e27362dbe2c-Paper-Conference.pdf

2025
[69]

& Behrmann, M

Moscovitch, M., Winocur, G. & Behrmann, M. What Is Special about Face Recognition? Nineteen Exper- iments on a Person with Visual Object Agnosia and Dyslexia but Normal Face Recognition.Journal of Cognitive Neuroscience9, 555–604 (1997). https://doi.org/10.1162/jocn.1997.9.5.555

work page doi:10.1162/jocn.1997.9.5.555 1997
[70]

Plaut, D. C. & Behrmann, M. Complementary neural representations for faces and words: A computa- tionalexploration.CognitiveNeuropsychology28,251–275(2011). https://doi.org/10.1080/02643294.2011. 609812

work page doi:10.1080/02643294.2011 2011
[71]

& Dobs, K

Kar, K., Kanwisher, N. & Dobs, K. Deep neural networks optimized for both face detection and face discrimination most accurately predict face-selective neurons in macaque inferior temporal cortex.2023 Conference on Cognitive Computational Neuroscience(2023). https://doi.org/10.32470/CCN.2023.1554-0

work page doi:10.32470/ccn.2023.1554-0 2023
[72]

Grosbard, I. D. & Yovel, G. Self-supervision deep learning models are better models of human high- level visual cortex: The roles of multi-modality and dataset training size.bioRxiv Preprint(2025). https: //doi.org/10.1101/2025.01.09.632216

work page doi:10.1101/2025.01.09.632216 2025
[73]

https://doi.org/10.64898/2026

Lee,S.,Ying,J.,Dey,A.,Jeon,Y.-N.&Issa,E.B.Efficienttaskgeneralizationandhumanlikefaceperception in models that learn to discriminate face geometry.bioRxiv Preprint(2026). https://doi.org/10.64898/2026. 01.31.703048

work page doi:10.64898/2026 2026
[74]

& Chun, M

Kanwisher, N., McDermott, J. & Chun, M. M. The fusiform face area: A module in human extrastriate cortex specialized for face perception.Journal of Neuroscience17, 4302–4311 (1997). https://doi.org/10. 1523/JNEUROSCI.17-11-04302.1997

work page arXiv 1997
[75]

Freiwald, W. A. & Tsao, D. Y. Functional compartmentalization and viewpoint generalization within the macaque face-processing system.Science330, 845–851 (2010). https://doi.org/10.1126/science.1194908

work page doi:10.1126/science.1194908 2010
[76]

& Sun, J

He, K., Zhang, X., Ren, S. & Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.2015 IEEE International Conference on Computer Vision (ICCV), 1026–1034 (2015). https://doi.org/10.1109/ICCV.2015.123

work page doi:10.1109/iccv.2015.123 2015
[77]

A framework for few-shot language model evaluation

Falcon, W. & The PyTorch Lightning team. PyTorch Lightning (2019). https://doi.org/10.5281/zenodo. 3828935

work page doi:10.5281/zenodo 2019
[78]

Albumentations:fastandflexible image augmentations.Information11, 125 (2020)

Buslaev,A.,Parinov,A.,Khvedchenya,E.,Iglovikov,V.I.&Kalinin,A.A. Albumentations:fastandflexible image augmentations.Information11, 125 (2020). https://doi.org/10.3390/info11020125. 24

work page doi:10.3390/info11020125 2020
[79]

Oscar Reutersvärd

Ravi, N.et al.Accelerating 3D deep learning with PyTorch3D.arXiv Preprint(2020). https://doi.org/10. 48550/arXiv.2007.08501

work page arXiv 2020
[80]

Bayesian Experimental Design: A Review , volume =

Chaloner, K. & Verdinelli, I. Bayesian experimental design: A review.Statistical Science10, 273–304 (1995). https://doi.org/10.1214/ss/1177009939

work page doi:10.1214/ss/1177009939 1995

Showing first 80 references.