arxiv: 2604.05039 · v1 · submitted 2026-04-06 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

ID-Sim: An Identity-Focused Similarity Metric

Cusuh Ham, Jui-Hsien Wang, Julia Chae, Nicholas Kolkin, Richard Zhang, Sara Beery

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords identity similarity metrichuman perception alignmentsynthetic data augmentationcomputer vision evaluationpersonalized image generationidentity consistencyfeed-forward networkbenchmark

0 comments

The pith

ID-Sim is a feed-forward metric that measures image similarity according to human selective sensitivity to identities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ID-Sim to close the gap between vision models and humans, who easily tell apart very similar identities even when viewpoint, lighting, or other context changes sharply. It constructs the metric from a high-quality training set of real-world images across many domains, then adds generative synthetic pairs that vary identity and context in controlled fine-grained steps. The result is evaluated on a new unified benchmark that checks agreement with human labels on recognition, retrieval, and generative tasks. A sympathetic reader would care because existing similarity measures do not track this human capacity, which limits reliable assessment of personalized image generation and related identity-preserving work.

Core claim

ID-Sim is a feed-forward metric designed to faithfully reflect human selective sensitivity to identities. It is trained on a curated high-quality set of images from diverse real-world domains, augmented with generative synthetic data that supplies controlled, fine-grained variations in both identity and context, and is assessed on a new unified benchmark for consistency with human annotations across identity-focused recognition, retrieval, and generative tasks.

What carries the argument

The ID-Sim feed-forward network, which outputs similarity scores trained to match human judgments on whether two images show the same identity under varying conditions, using mixed real and synthetic training pairs.

If this is right

Provides a more reliable signal for judging whether generated images preserve a target identity across edits or style transfers.
Allows consistent ranking of methods on identity retrieval and recognition benchmarks that were previously hard to compare.
Supports direct use as an evaluation tool during development of models for personalized generation tasks.
Highlights where current vision models diverge from human identity distinctions, guiding targeted improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same training approach could be reused to create loss terms that directly optimize models for identity consistency during fine-tuning.
If the metric generalizes, it might serve as a drop-in replacement for generic perceptual losses in any pipeline that must keep specific faces or objects recognizable.
Extension to video or multi-view settings would test whether the learned sensitivity holds when temporal or geometric context also changes.
Similar synthetic-augmentation pipelines could be applied to other human-selective dimensions such as material appearance or facial expression.

Load-bearing premise

The high-quality training set spanning diverse domains plus the generative synthetic augmentations, together with the new unified benchmark, accurately capture and measure human selective sensitivity to identities without significant bias or domain gaps.

What would settle it

A fresh collection of human similarity ratings on identity pairs drawn from domains or variation types not seen in training, where ID-Sim scores show no stronger correlation with those ratings than standard metrics such as LPIPS or feature cosine similarity.

Figures

Figures reproduced from arXiv: 2604.05039 by Cusuh Ham, Jui-Hsien Wang, Julia Chae, Nicholas Kolkin, Richard Zhang, Sara Beery.

**Figure 1.** Figure 1: ID-Sim motivation & results. (Left) An identity-focused metric should exhibit selective sensitivity: invariant to contextual changes (e.g. background, pose, lighting), yet sensitive to subtle identity-altering changes. (Right) We present ID-SIM, which captures this property more effectively than existing metrics, and achieves strong improvements across a diverse set of identity-focused tasks. Abstract Huma… view at source ↗

**Figure 2.** Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: ID-Sim training pipeline. We train our metric with dual contrastive supervision. At the global level, CLS-token projections for anchor–positive pairs are contrasted against one hard negative and additional batch negatives using InfoNCE. At the patch level, projected patch tokens are compared using Sinkhorn distance for the same instance pairs. itives come from real instance images (S1) or identitypreservi… view at source ↗

**Figure 5.** Figure 5: Newly annotated Subjects2k. We release a 2k highquality human annotations with a subset of Subjects200k to serve as a new challenging concept preservation eval benchmark. 2. Instance retrieval tests the ability to find images of a given reference object from a pool of distractors. We report mean AP (mAP), averaged across each instance in the datasets on: (a) PODS [77], a dataset of household objects for … view at source ↗

**Figure 4.** Figure 4: Performance of ID-Sim vs. baseline models. We compare ID-Sim against standard perceptual metrics, large-scale vision foundation models, and a supervised “Universal Embedding” model (the top entry in Google’s universal embedding challenge). Across tasks – instance retrieval, concept preservation, and re-identification – ID-Sim consistently outperforms all baselines, including the instanceretrieval-focused … view at source ↗

**Figure 6.** Figure 6: Selective sensitivity analysis. We evaluate model sensitivity across four axes of visual change: identity, background, viewpoint, and lighting. For 100 anchor instances, we generate controlled variations and compute both sensitivity scores and similarity trends. (Top row.) Compared with baseline methods, our model is notably more sensitive to identity differences while remaining stable under background, vi… view at source ↗

**Figure 7.** Figure 7: Filtered out FORB logo category. We observe consistent appearance inconsistencies between the same ”instance” category in FORB’s ”logo” class. Instance 49333 beach Instance 66721 parks Instance 137270 mountain Instance 176458 city Instance 195725 village Instance 49333 beach Instance 66721 parks Instance 137270 mountain Instance 176458 city Instance 195725 village [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Filtered GLDv2 categories. Many GLDv2 classes cover broad geographic areas rather than a single localized site, building or an object, making it difficult for a class to correspond to a consistent visual identity. Dataset Filtered MET 671 ILIAS 826 WildlifeReID10k (Dogs and Cats) 1501 FORB (Filtered) 2346 GLDv2 (Filtered) 2315 DeepFashion2 2341 Validation ROC AUC 0.89 [PITH_FULL_IMAGE:figures/full_fig_p01… view at source ↗

**Figure 9.** Figure 9: Generative Contextual Edited Images [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Generative Edited Hard-Negatives Use in training. Identity-edited images are used only as hard negatives. To avoid the model relying on generative artifacts to identify these negatives, we add mild generative noise (strength 0.1) to the anchor and positive whenever a triplet includes an identity-edited negative. This noise does not change image content but prevents artifact-based shortcuts. Negatives are… view at source ↗

**Figure 13.** Figure 13: AerialCattle2017. This dataset is composed of aerial imagery of various cows on fields, and the task is to retrieve the same individuals based on a query image. • Protocol: Predict whether two images depict the same individual • Metrics: mAP (main) Pair 1 ferret Pair 2 cat Pair 3 chinchilla Pair 4 pig Pair 5 dog [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Petface. Evaluation benchmark of 13 unseen animals. Red depicts different individual and green depicts same individual. CUTE (Triplet Matching) • Task: Fine-grained object discrimination using triplet matching • Size: 1,800 triplets • Structure: Each sample contains an anchor, a positive (same instance), and a negative (different instance) • Modes: 1) Easy mode uses triplets in which all three images come… view at source ↗

**Figure 16.** Figure 16: Subjects2k Pairs. Newly annotated 2k subset of Subjects200k [80]. Green depicts same instance, red is different. DreamBench++ (Discrete Rating for Generative Model) • Task: Identity preservation in subject-driven image generation • Size: 6,921 valid pairs (after filtering) • Ratings: Discrete identity score in [0, 4] • References: 110 reference subjects • Protocol: Rank generated images by predicted simi… view at source ↗

**Figure 17.** Figure 17: DreamBench Pairs. DreamBench images are accompanied by human annotations out of 4. D.2. Subjects2k Human Annotation Pipeline Motivation DreamBench++ [53] is one of the most widely used human benchmark for evaluating concept preservation in personalized generation, but its annotation design introduces significant noise. Each image receives only two human ratings, and annotators provide a 0–4 rubric score … view at source ↗

**Figure 19.** Figure 19: Subjects2k Annotation Server Task Page. Example of a task page for our annotators. one or more sentinels were discarded. This procedure ensured a high-quality, reliable annotation set. Each pair was annotated in batches: we first obtained labels from three annotators (post-filtering). If all three agreed (all “same” or all “different”), we stopped for that pair. If there was any disagreement, we collec… view at source ↗

**Figure 18.** Figure 18: Introduction Page for Subjects2k Annotation Server. We provide a clear definition of an instance to all participants prior to starting their annotations. Subjects2k: Human Annotation Summary We collected human judgments for all 2,000 image pairs in Subjects2k and inserted 7 manually-verified sentinel pairs with known ground-truth labels. After each annotation batch, we filtered annotators by requiring … view at source ↗

**Figure 21.** Figure 21: GPT-Generated prompt used for MLLM standardized [PITH_FULL_IMAGE:figures/full_fig_p024_21.png] view at source ↗

**Figure 22.** Figure 22: Qualitative Results for Per-SAM. We show predicted segmentation masks and corresponding predicted confidence scores, ordered in highest to lowest with respect to a reference object. First, we observe that when combined with PerSAM, both ID-Sim and DINOv3 are able to produce reliable segmentation mask predictions (mask drawn in red around the instance). However, we observe that ID-Sim is significantly … view at source ↗

**Figure 23.** Figure 23: Dense masks can resolve ambiguities in multi-object Scenes. Given the test image with 2 shirts (left), ID-Sim features are sensitive to the identity of the query image (right 2 images), evidenced by the patch-level similarity heatmaps (2nd to left). E.2. Full results In [PITH_FULL_IMAGE:figures/full_fig_p025_23.png] view at source ↗

**Figure 24.** Figure 24: Limitations of DreamBench++ annotations. DreamBench++ assigns only two human rubric scores (0–4) per image, which leads to substantial noise in concept-preservation evaluation. As shown above, (i) images with the same DreamBench score can exhibit large variation in identity similarity, and (ii) images with high identity similarity may still receive widely different DreamBench scores. These inconsistencies… view at source ↗

**Figure 25.** Figure 25: Full quantitative comparison across all benchmarks. We report complete numerical results for all datasets and baselines. For ID-Sim, we show mean ± standard deviation over 10 independent training runs. All evaluations use the CLS embedding at inference, consistent with the main paper [PITH_FULL_IMAGE:figures/full_fig_p026_25.png] view at source ↗

**Figure 26.** Figure 26: Background vs. Identity Variation Grid. Rows vary the foreground identity through Qwen-Edit inpainting at increasing edit strengths, while columns vary the scene background using inpainting prompts. Each cell shows the similarity of the edited image to the original anchor. This grid isolates how models respond jointly to identity changes and background shifts [PITH_FULL_IMAGE:figures/full_fig_p028_26.png] view at source ↗

**Figure 27.** Figure 27: Viewpoint Variation Grid. Rows vary identity strength and columns sweep natural viewpoint changes using the multi-view MVImgNet sequence. This grid evaluates how well each model maintains invariance to viewpoint while still detecting identity-altering edits [PITH_FULL_IMAGE:figures/full_fig_p029_27.png] view at source ↗

**Figure 28.** Figure 28: Lighting Variation Grid. Rows correspond to increasing levels of identity change, while columns apply eight different lighting edits using Qwen-Edit. This grid tests whether models remain stable under illumination changes while remaining sensitive to small identity perturbations [PITH_FULL_IMAGE:figures/full_fig_p030_28.png] view at source ↗

read the original abstract

Humans have remarkable selective sensitivity to identities -- easily distinguishing between highly similar identities, even across significantly different contexts such as diverse viewpoints or lighting. Vision models have struggled to match this capability, and progress toward identity-focused tasks such as personalized image generation is slowed by a lack of identity-focused evaluation metrics. To help facilitate progress, we propose ID-Sim, a feed-forward metric designed to faithfully reflect human selective sensitivity. To build ID-Sim, we curate a high-quality training set of images spanning diverse real-world domains, augmented with generative synthetic data that provides controlled, fine-grained identity and contextual variations. We evaluate our metric on a new unified evaluation benchmark for assessing consistency with human annotations across identity-focused recognition, retrieval, and generative tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ID-Sim gives a practical new metric and benchmark for human-like identity similarity but the synthetic augmentations are the part that needs real proof they do not create shortcuts.

read the letter

The main thing to know about ID-Sim is that it is a new feed-forward similarity metric trained to align with human judgments on identity, using a combination of real-world images and generative synthetic data for controlled variations, along with a unified benchmark spanning recognition, retrieval, and generation. They put together a training set from diverse real domains and added synthetic augmentations to create fine-grained changes in identity and context. That combination plus the single benchmark for multiple tasks is what stands out as new. The work does a good job highlighting why standard metrics fall short for selective identity sensitivity and offering a direct way to evaluate progress on identity-focused tasks like personalized generation. The approach makes sense for getting the kind of data needed to train such a metric. The soft spot is the potential for the synthetic augmentations to introduce artifacts or biased variations that humans ignore when judging identity. The metric could learn to use those instead of real human cues, and the paper would need to show that this does not happen through ablations or attention analysis. Without the quantitative results and error breakdowns in the abstract, it is tough to tell how well the training actually captures human sensitivity or if domain gaps remain. This paper targets researchers in computer vision who work on identity preservation in generation or retrieval systems. Anyone needing better ways to measure consistency with human identity perception could find the benchmark helpful if the numbers check out. It shows honest engagement with the evaluation problem, so it deserves a serious referee to check the data quality and results. I would recommend sending it to peer review, but with a note to reviewers to examine whether the synthetic data is truly helping or creating shortcuts.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ID-Sim, a feed-forward similarity metric intended to capture human selective sensitivity to identities across varying contexts such as viewpoint and lighting changes. It is constructed by curating a high-quality training set of real-world images from diverse domains, augmented with generative synthetic data to enable controlled fine-grained variations in identity and context. The metric is then assessed for consistency with human annotations on a newly introduced unified benchmark covering identity-focused recognition, retrieval, and generative tasks.

Significance. If the central claim holds, ID-Sim would fill a notable gap in evaluation tools for identity-centric vision applications, especially personalized image generation where standard metrics often diverge from human identity judgments. The combination of real data with controlled synthetic augmentations offers a structured way to target selective sensitivity, though the approach's value rests on demonstrating that the resulting metric aligns with human cues rather than model-specific artifacts.

major comments (2)

[Abstract] Abstract: the central claim that ID-Sim 'faithfully reflect[s] human selective sensitivity' is presented without any architecture details, training procedure, loss function, quantitative results, or error analysis. This absence makes it impossible to verify whether the data and method support the claim, as the soundness assessment is limited to high-level motivation.
[Training Set and Benchmark] Training set and benchmark description: the reliance on generative synthetic augmentations for controlled identity/context variations introduces the risk that ID-Sim learns non-human cues (e.g., texture inconsistencies or lighting hallucinations typical of generative models) instead of human-like identity discrimination. No ablations, controls, or analysis are described to rule out this possibility or to confirm the benchmark annotations are free from similar generative biases or domain gaps.

minor comments (1)

[Abstract] The phrase 'feed-forward metric' is introduced without definition or comparison to existing similarity measures; a brief clarification of the network structure or inference properties would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We provide point-by-point responses to the major comments below and describe the revisions we will implement to address the raised issues.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that ID-Sim 'faithfully reflect[s] human selective sensitivity' is presented without any architecture details, training procedure, loss function, quantitative results, or error analysis. This absence makes it impossible to verify whether the data and method support the claim, as the soundness assessment is limited to high-level motivation.

Authors: The abstract is intentionally concise to summarize the contribution. The full manuscript provides detailed descriptions of the architecture, training procedure, loss function, quantitative results, and error analysis in the dedicated sections. To improve the abstract's informativeness and allow readers to better assess the claim upfront, we will revise it to include high-level mentions of these elements without exceeding typical length constraints. revision: yes
Referee: [Training Set and Benchmark] Training set and benchmark description: the reliance on generative synthetic augmentations for controlled identity/context variations introduces the risk that ID-Sim learns non-human cues (e.g., texture inconsistencies or lighting hallucinations typical of generative models) instead of human-like identity discrimination. No ablations, controls, or analysis are described to rule out this possibility or to confirm the benchmark annotations are free from similar generative biases or domain gaps.

Authors: We recognize the importance of ruling out the possibility that ID-Sim learns artifacts from the generative synthetic data rather than human-like identity sensitivity. Our approach balances real and synthetic data to leverage the strengths of both, but we agree that additional validation is needed. In the revised version, we will add ablations training on real data alone, comparisons of performance on real versus synthetic test images, and further analysis of the human annotations on the benchmark to assess potential biases or domain gaps. revision: yes

Circularity Check

0 steps flagged

No circularity; trained metric evaluated against independent human annotations

full rationale

The paper proposes ID-Sim as a feed-forward model trained on a high-quality image set (real domains plus generative synthetic augmentations for identity/context control) and then assessed for consistency with human annotations on a separate unified benchmark spanning recognition, retrieval, and generative tasks. No equations, derivations, or self-citations are presented in the provided text that would reduce the metric's claimed fidelity to human selective sensitivity back to its own training inputs by construction. The evaluation step is framed as an external check rather than a tautology, and the central claim remains falsifiable via the benchmark's human consistency scores. This is the expected non-finding for an empirical ML metric paper whose load-bearing content is data curation and supervised training rather than a closed mathematical chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based only on the abstract, no explicit free parameters, axioms, or invented entities are stated. The metric is described as trained on data, implying learned parameters, but none are enumerated.

pith-pipeline@v0.9.0 · 5429 in / 1111 out tokens · 50272 ms · 2026-05-10T20:25:20.867943+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We train our metric with dual contrastive supervision... Ltotal = LCLS(c) + λLPatch(Z) ... simpatch(A,B) = −Sε(A,B) (Sinkhorn distance)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear
curate a high-quality training set... augmented with generative synthetic data... balanced positive/negative samples

Reference graph

Works this paper leans on

129 extracted references · 21 canonical work pages · 7 internal anchors

[1]

Wildlifereid-10k: Wildlife re-identification dataset with 10k individual animals

Luk ´aˇs Adam, V ojtˇech ˇCerm´ak, Kostas Papafitsoros, and Lukas Picek. Wildlifereid-10k: Wildlife re-identification dataset with 10k individual animals. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2090–2100. IEEE, 2025. 2, 1

2090
[2]

Adobe photoshop

Adobe Inc. Adobe photoshop. 6
[3]

Dowsey, and Tilo Burghardt

William Andrew, Jing Gao, Siobhan Mullan, Neill Camp- bell, Andrew W. Dowsey, and Tilo Burghardt. Visual identi- fication of individual holstein-friesian cattle via deep metric learning.Computers and Electronics in Agriculture, 185: 106133, 2021. 5

2021
[4]

Recognition-by-components: a theory of human image understanding.Psychological review, 94 (2):115, 1987

Irving Biederman. Recognition-by-components: a theory of human image understanding.Psychological review, 94 (2):115, 1987. 1

1987
[5]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021. 3

2021
[6]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020. 2, 3

2020
[7]

Deep learning for instance retrieval: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7270–7292, 2022

Wei Chen, Yu Liu, Weiping Wang, Erwin M Bakker, Theodoros Georgiou, Paul Fieguth, Li Liu, and Michael S Lew. Deep learning for instance retrieval: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7270–7292, 2022. 2

2022
[8]

When does contrastive visual representation learning work? InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14755–14764, 2022

Elijah Cole, Xuan Yang, Kimberly Wilber, Oisin Mac Aodha, and Serge Belongie. When does contrastive visual representation learning work? InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14755–14764, 2022. 7

2022
[9]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4690–4699, 2019. 2

2019
[10]

Untangling invariant object recognition.Trends in cognitive sciences, 11(8):333– 341, 2007

James J DiCarlo and David D Cox. Untangling invariant object recognition.Trends in cognitive sciences, 11(8):333– 341, 2007. 1

2007
[11]

Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020

Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020. 2

2020
[12]

An im- age is worth 16x16 words: Transformers for image recog- nition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An im- age is worth 16x16 words: Transformers for image recog- nition at scale, 2021. 4

2021
[13]

Mind-the-glitch: Visual cor- respondence for detecting inconsistencies in subject-driven generation, 2025

Abdelrahman Eldesokey, Aleksandar Cvejic, Bernard Ghanem, and Peter Wonka. Mind-the-glitch: Visual cor- respondence for detecting inconsistencies in subject-driven generation, 2025. 2, 3

2025
[14]

Chen et al

T.S. Chen et al. Omni-attribute: Open-vocabulary attribute encoder for visual concept personalization, 2025. 8

2025
[15]

Wang et al

X. Wang et al. Dense contrastive learning for self- supervised visual pre-training, 2021. 4

2021
[16]

La- sot: A high-quality large-scale single object tracking bench- mark, 2020

Heng Fan, Hexin Bai, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Harshit, Mingzhen Huang, Juehuan Liu, Yong Xu, Chunyuan Liao, Lin Yuan, and Haibin Ling. La- sot: A high-quality large-scale single object tracking bench- mark, 2020. 3

2020
[17]

Interpo- lating between optimal transport and mmd using sinkhorn divergences

Jean Feydy, Thibault S ´ejourn´e, Franc ¸ois-Xavier Vialard, Shun-ichi Amari, Alain Trouve, and Gabriel Peyr´e. Interpo- lating between optimal transport and mmd using sinkhorn divergences. InThe 22nd International Conference on Ar- tificial Intelligence and Statistics, pages 2681–2690, 2019. 4

2019
[18]

Dream- sim: Learning new dimensions of human visual similar- ity using synthetic data.arXiv preprint arXiv:2306.09344,

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dream- sim: Learning new dimensions of human visual similar- ity using synthetic data.arXiv preprint arXiv:2306.09344,

work page arXiv
[19]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022. 2

work page internal anchor Pith review arXiv 2022
[20]

Deepfashion2: A versatile benchmark for de- tection, pose estimation, segmentation and re-identification of clothing images

Yuying Ge, Ruimao Zhang, Xiaogang Wang, Xiaoou Tang, and Ping Luo. Deepfashion2: A versatile benchmark for de- tection, pose estimation, segmentation and re-identification of clothing images. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 5337–5345, 2019. 3, 5, 1

2019
[21]

Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020. 2

2020
[22]

Per- sonalized residuals for concept-driven text-to-image gen- eration

Cusuh Ham, Matthew Fisher, James Hays, Nicholas Kolkin, Yuchen Liu, Richard Zhang, and Tobias Hinz. Per- sonalized residuals for concept-driven text-to-image gen- eration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8186– 8195, 2024. 2

2024
[23]

Momentum contrast for unsupervised visual rep- resentation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9729–9738, 2020. 2, 3

2020
[24]

Foreground-aware pyra- mid reconstruction for alignment-free occluded person re- identification

Lingxiao He, Yinggang Wang, Wu Liu, He Zhao, Zhenan Sun, and Jiashi Feng. Foreground-aware pyra- mid reconstruction for alignment-free occluded person re- identification. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 8450–8459,
[25]

Conceptrol: Concept control of zero-shot personalized image generation.arXiv preprint arXiv:2503.06568, 2025

Qiyuan He and Angela Yao. Conceptrol: Concept control of zero-shot personalized image generation.arXiv preprint arXiv:2503.06568, 2025. 2

work page arXiv 2025
[26]

Learning deep representations by mutual information estimation and maximization

R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual in- formation estimation and maximization.arXiv preprint arXiv:1808.06670, 2018. 2

work page Pith review arXiv 2018
[27]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010. 2

2010
[28]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. 5

2021
[29]

Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5):1562–1577, 2021

Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5):1562–1577, 2021. 3

2021
[30]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 4, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Open- clip, 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. If you use this software, please cite it as below. 2, 4, 5

2021
[32]

Personalized vision via visual in-context learning, 2025

Yuxin Jiang, Yuchao Gu, Yiren Song, Ivor Tsang, and Mike Zheng Shou. Personalized vision via visual in-context learning, 2025. 2

2025
[33]

Supervised contrastive learning, 2021

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning, 2021. 4

2021
[34]

Pose-dIVE: Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification

In `es Hyeonsu Kim, JoungBin Lee, Woojeong Jin, Soowon Son, Kyusun Cho, Junyoung Seo, Min-Seop Kwak, Seokju Cho, JeongYeol Baek, Byeongwon Lee, et al. Pose-dive: Pose-diversified augmentation with diffusion model for per- son re-identification.arXiv preprint arXiv:2406.16042,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Ilias: Instance-level image retrieval at scale

Giorgos Kordopatis-Zilos, Vladan Stojni ´c, Anna Manko, Pavel Suma, Nikolaos-Antonios Ypsilantis, Nikos Efthymi- adis, Zakaria Laskar, Jiri Matas, Ondrej Chum, and Giorgos Tolias. Ilias: Instance-level image retrieval at scale. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 14777–14787, 2025. 3, 1

2025
[36]

Yamins, and Jiajun Wu

Klemen Kotar, Stephen Tian, Hong-Xing Yu, Daniel L.K. Yamins, and Jiajun Wu. Are these the same apple? com- paring images based on object intrinsics. 2023. 5

2023
[37]

Imagenet classification with deep convolutional neural net- works.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works.Advances in neural information processing systems, 25, 2012. 2

2012
[38]

Multi-concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023. 2

1931
[39]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 5, 10

2024
[40]

Sphereface: Deep hypersphere embed- ding for face recognition

Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embed- ding for face recognition. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 212–220, 2017. 2

2017
[41]

Zhang, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, and David Novotny

Xingchen Liu, Piyush Tayal, Jianyuan Wang, Jesus Zarzar, Tom Monnier, Konstantinos Tertikas, Jiali Duan, An- toine Toisoul, Jason Y . Zhang, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, and David Novotny. Uncom- mon objects in 3d. InarXiv, 2024. 3, 5

2024
[42]

Psychophysical and physiological evidence for viewer-centered object represen- tations in the primate.Cerebral cortex, 5(3):270–288, 1995

Nikos K Logothetis and Jon Pauls. Psychophysical and physiological evidence for viewer-centered object represen- tations in the primate.Cerebral cortex, 5(3):270–288, 1995. 1

1995
[43]

A differentiable perceptual au- dio metric learned from just noticeable differences,

Pranay Manocha, Adam Finkelstein, Richard Zhang, Nicholas J Bryan, Gautham J Mysore, and Zeyu Jin. A differentiable perceptual audio metric learned from just noticeable differences.arXiv preprint arXiv:2001.04460,

work page arXiv 2001
[44]

A deep learn- ing approach for dog face verification and recognition

Guillaume Mougeot, Dewei Li, and Shuai Jia. A deep learn- ing approach for dog face verification and recognition. In PRICAI 2019: Trends in Artificial Intelligence, pages 418– 430, Cham, 2019. Springer International Publishing. 3

2019
[45]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

Gpt-4v (vision): Multimodal gpt-4 with image and text input.https://openai.com/research/ gpt-4v-system-card, 2023

OpenAI. Gpt-4v (vision): Multimodal gpt-4 with image and text input.https://openai.com/research/ gpt-4v-system-card, 2023. Accessed: 2025-11-13. 3, 5

2023
[47]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervi- sion.arXiv preprint arXiv:2304.07193, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Dinov2: Learning robust vi- sual features without supervision, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Rus- sell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Jegou, Julien Mairal, Pa...

2024
[49]

Multispecies animal re-id using a large community-curated dataset.arXiv preprint arXiv:2412.05602, 2024

Lasha Otarashvili, Tamilselvan Subramanian, Jason Holm- berg, JJ Levenson, and Charles V Stewart. Multispecies an- imal re-id using a large community-curated dataset.arXiv preprint arXiv:2412.05602, 2024. 2

work page arXiv 2024
[50]

The role of back- ground knowledge in speeded perceptual categorization

Thomas J Palmeri and Celina Blalock. The role of back- ground knowledge in speeded perceptual categorization. Cognition, 77(2):B45–B57, 2000. 1

2000
[51]

Visual object un- derstanding.Nature Reviews Neuroscience, 5(4):291–303,

Thomas J Palmeri and Isabel Gauthier. Visual object un- derstanding.Nature Reviews Neuroscience, 5(4):291–303,
[52]

Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024. 3

work page arXiv 2024
[53]

Dreambench++: A human- aligned benchmark for personalized image generation

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Run- pei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human- aligned benchmark for personalized image generation. In The Thirteenth International Conference on Learning Rep- resentations, 2025. 2, 5, 6, 9

2025
[54]

Pieapp: Perceptual image-error assessment through pairwise preference

Ekta Prashnani, Hong Cai, Yasamin Mostofi, and Pradeep Sen. Pieapp: Perceptual image-error assessment through pairwise preference. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, pages 1808–1817, 2018. 2

2018
[55]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 4

2021
[56]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 3, 5

2021
[57]

Cognitive representations of semantic cat- egories.Journal of experimental psychology: General, 104 (3):192, 1975

Eleanor Rosch. Cognitive representations of semantic cat- egories.Journal of experimental psychology: General, 104 (3):192, 1975. 1

1975
[58]

Blur detection with opencv.https: / / pyimagesearch

Adrian Rosebrock. Blur detection with opencv.https: / / pyimagesearch . com / 2015 / 09 / 07 / blur - detection-with-opencv/, 2015. Accessed: 2021- 07-12. 3

2015
[59]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 2

2023
[60]

Complex wavelet structural sim- ilarity: A new image similarity index.IEEE transactions on image processing, 18(11):2385–2401, 2009

Mehul P Sampat, Zhou Wang, Shalini Gupta, Alan Conrad Bovik, and Mia K Markey. Complex wavelet structural sim- ilarity: A new image similarity index.IEEE transactions on image processing, 18(11):2385–2401, 2009. 2

2009
[61]

Where’s waldo: Diffusion features for per- sonalized segmentation and retrieval, 2024

Dvir Samuel, Rami Ben-Ari, Matan Levy, Nir Darshan, and Gal Chechik. Where’s waldo: Diffusion features for per- sonalized segmentation and retrieval, 2024. 2

2024
[62]

Gpr1200: A benchmark for general-purpose content- based image retrieval

Konstantin Schall, Kai Uwe Barthel, Nico Hezel, and Klaus Jung. Gpr1200: A benchmark for general-purpose content- based image retrieval. InMultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Viet- nam, June 6–10, 2022, Proceedings, Part I, page 205–216, Berlin, Heidelberg, 2022. Springer-Verlag. 2

2022
[63]

Past, present and future approaches using computer vision for animal re-identification from camera trap data.Methods in Ecology and Evolution, 10(4):461– 470, 2019

Stefan Schneider, Graham W Taylor, Stefan Linquist, and Stefan C Kremer. Past, present and future approaches using computer vision for animal re-identification from camera trap data.Methods in Ecology and Evolution, 10(4):461– 470, 2019. 2

2019
[64]

Similarity learning networks for animal individual re- identification: an ecological perspective.Mammalian Biol- ogy, 102(3):899–914, 2022

Stefan Schneider, Graham W Taylor, and Stefan C Kre- mer. Similarity learning networks for animal individual re- identification: an ecological perspective.Mammalian Biol- ogy, 102(3):899–914, 2022. 2

2022
[65]

Facenet: A unified embedding for face recognition and clustering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. InProceedings of the IEEE conference on com- puter vision and pattern recognition, pages 815–823, 2015. 2

2015
[66]

Minimiz- ing embedding distortion for robust out-of-distribution per- formance.arXiv preprint arXiv:2409.07582, 2024

Tom Shaked, Yuval Goldman, and Oran Shayer. Minimiz- ing embedding distortion for robust out-of-distribution per- formance.arXiv preprint arXiv:2409.07582, 2024. 2

work page arXiv 2024
[67]

1st solution in google universal image embedding.https://www.kaggle

Shihao Shao and Qinghua Cui. 1st solution in google universal image embedding.https://www.kaggle. com / datasets / louieshao / guieweights0732,
[68]

Judging the Judges: A Systematic Study of Position Bias in

Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, We- icheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge.arXiv preprint arXiv:2406.07791, 2024. 3

work page arXiv 2024
[69]

Petface: A large-scale dataset and benchmark for animal identification, 2024

Risa Shinoda and Kaede Shiohara. Petface: A large-scale dataset and benchmark for animal identification, 2024. 5

2024
[70]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014. 2

work page internal anchor Pith review Pith/arXiv arXiv 2014
[71]

Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth ´ee Darcet, Th ´eo Moutakanni, Leonel Sen- tana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Coup...

2025
[72]

Styledrop: Text-to-image generation in any style.arXiv preprint arXiv:2306.00983,

Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang, Glenn Entis, Yuanzhen Li, et al. Styledrop: Text-to-image generation in any style.arXiv preprint arXiv:2306.00983,

work page arXiv
[73]

Deep metric learning via lifted structured feature embedding, 2015

Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding, 2015. 1

2015
[74]

Generalizable person re- identification by domain-invariant mapping network

Jifei Song, Yongxin Yang, Yi-Zhe Song, Tao Xiang, and Timothy M Hospedales. Generalizable person re- identification by domain-invariant mapping network. In Proceedings of the IEEE/CVF conference on Computer Vi- sion and Pattern Recognition, pages 719–728, 2019. 2

2019
[75]

Diffsim: Taming diffusion models for evaluating visual similarity

Yiren Song, Xiaokang Liu, and Mike Zheng Shou. Diffsim: Taming diffusion models for evaluating visual similarity. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16904–16915, 2025. 2, 5

2025
[76]

Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline)

Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). InPro- ceedings of the European conference on computer vision (ECCV), pages 480–496, 2018. 2

2018
[77]

Personalized representation from personalized generation, 2024

Shobhita Sundaram, Julia Chae, Yonglong Tian, Sara Beery, and Phillip Isola. Personalized representation from personalized generation, 2024. 2, 5

2024
[78]

Tamir, Lucy Chai, Simon Kornblith, Trevor Darrell, and Phillip Isola

Shobhita Sundaram, Stephanie Fu, Lukas Muttenthaler, Ne- tanel Y . Tamir, Lucy Chai, Simon Kornblith, Trevor Darrell, and Phillip Isola. When does perceptual alignment benefit vision representations?, 2024. 3

2024
[79]

What makes for a good stereoscopic image? InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 261–272, 2025

Netanel Tamir, Shir Amir, Ranel Itzhaky, Noam Atia, Shob- hita Sundaram, Stephanie Fu, Ron Sokolovsky, Phillip Isola, Tali Dekel, Richard Zhang, et al. What makes for a good stereoscopic image? InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 261–272, 2025. 2

2025
[80]

Ominicontrol: Minimal and uni- versal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and uni- versal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024. 5, 9, 10, 11

work page arXiv 2024

Showing first 80 references.