Recognition: unknown
CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
Pith reviewed 2026-05-10 16:17 UTC · model grok-4.3
The pith
CLAY reframes pretrained vision-language embeddings as text-conditional similarity spaces for flexible image retrieval without retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CLAY achieves high retrieval accuracy and notable computational efficiency by reframing the embedding space of pretrained Vision-Language Models as a text-conditional similarity space without additional training. This separates textual conditioning from visual feature extraction to support multi-conditioned retrieval using fixed visual embeddings.
What carries the argument
The modulation of similarity computation in the VLM embedding space by textual conditions, which produces conditional similarities while visual embeddings remain unchanged and no fine-tuning occurs.
If this is right
- Image retrieval can incorporate multiple user conditions at once without increased model size.
- Systems can avoid retraining or fine-tuning for new similarity criteria.
- Computational costs drop because visual features are extracted once and reused.
- The CLAY-EVAL dataset enables standardized testing of conditional retrieval methods.
Where Pith is reading between the lines
- Search interfaces could let users combine conditions on the fly for more personalized results.
- Similar modulation might apply to other embedding spaces like audio or text-only models.
- Future work could test if this holds when conditions conflict or are very specific.
- Deployment on edge devices becomes feasible due to the efficiency gains.
Load-bearing premise
Existing pretrained vision-language models already encode enough information in their embeddings that text can selectively activate the right similarities without distorting the space.
What would settle it
A test where for a given text condition like 'focus on color', the top retrieved images match human judgment on similarity under that condition at rates no better than random or fixed similarity baselines.
Figures
read the original abstract
Human perception of visual similarity is inherently adaptive and subjective, depending on the users' interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves high retrieval accuracy and notable computational efficiency compared to previous works.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes CLAY, a training-free method that reframes the embedding space of pretrained Vision-Language Models as a text-conditional similarity space. Textual conditioning is separated from visual feature extraction, with visual embeddings kept fixed to enable efficient multi-conditioned image retrieval. The authors introduce the synthetic CLAY-EVAL dataset and report that experiments on standard datasets plus CLAY-EVAL demonstrate high retrieval accuracy and computational efficiency relative to prior work.
Significance. If the central claims hold, the approach would enable flexible, user-specified conditional retrieval without model updates or fine-tuning, offering clear efficiency advantages for practical systems. The fixed-embedding design and introduction of CLAY-EVAL for diverse conditioned settings are positive contributions. Significance is limited by the untested premise that pretrained VLM spaces already encode the structures needed for arbitrary conditions.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The claims of high retrieval accuracy and notable efficiency are stated without any quantitative numbers, error bars, baseline tables, or ablation results in the abstract and are only summarized at high level in the experiments section; this absence directly undermines evaluation of the central no-training claim.
- [§3] §3 (Method): The modulation step that produces conditional similarity from fixed visual embeddings is presented as sufficient for arbitrary text conditions, yet no analysis, failure-case study, or out-of-distribution test is supplied to show that the pretrained space contains the required non-linear feature re-weightings; this assumption is load-bearing for the entire training-free result.
minor comments (2)
- [§3.2] The description of how multiple conditions are combined could be made more precise with an explicit formula or pseudocode in the method section.
- [§4.1] CLAY-EVAL construction details (prompt templates, condition diversity metrics) should be expanded in the supplementary material or a dedicated subsection to allow reproduction.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address the major comments point by point below and outline the revisions we plan to make to improve the clarity and substantiation of our claims.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The claims of high retrieval accuracy and notable efficiency are stated without any quantitative numbers, error bars, baseline tables, or ablation results in the abstract and are only summarized at high level in the experiments section; this absence directly undermines evaluation of the central no-training claim.
Authors: We agree that the abstract and experiments section would benefit from more specific quantitative details to support the claims. In the revised version, we will update the abstract to include key numerical results, such as the top-1 retrieval accuracy on CLAY-EVAL and efficiency comparisons (e.g., inference time reductions). We will also expand §4 to present full baseline tables, ablation studies on the modulation components, and error bars from multiple experimental runs, providing a more rigorous evaluation of the training-free method. revision: yes
-
Referee: [§3] §3 (Method): The modulation step that produces conditional similarity from fixed visual embeddings is presented as sufficient for arbitrary text conditions, yet no analysis, failure-case study, or out-of-distribution test is supplied to show that the pretrained space contains the required non-linear feature re-weightings; this assumption is load-bearing for the entire training-free result.
Authors: This is a valid point regarding the foundational assumption of our approach. While the empirical results across multiple datasets, including the diverse conditioned scenarios in CLAY-EVAL, demonstrate the effectiveness of the modulation in practice, we recognize the value of additional analysis. We will add to the revised manuscript a discussion in §3 on the properties of the pretrained embedding space that enable the conditional similarity, along with selected failure cases and tests on out-of-distribution conditions to better illustrate the limits and capabilities of the method without requiring training. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper's core proposal is a methodological reframing of pretrained VLM embedding spaces into a text-conditional similarity space via separation of conditioning and fixed visual feature extraction, with no additional training. The abstract and description contain no equations, fitted parameters, or predictions that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known results are merely renamed. The approach relies on external pretrained models as independent inputs, making the derivation non-circular and self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Ba- nayeeanzade, Mohammad Hossein Rohban, and Mahdieh So- leymani Baghshah. Analyzing clip’s performance limitations in multi-object scenarios: A controlled high-resolution study. arXiv preprint arXiv:2502.19828, 2025. 7
-
[2]
Zero-shot composed image retrieval with textual inversion
Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. Zero-shot composed image retrieval with textual inversion. InICCV, 2023. 2, 3, 6
2023
-
[3]
Beyond the highlights: Video retrieval with salient and surrounding contexts
Jaehun Bang, Moon Ye-Bin, Tae-Hyun Oh, and Kyungdon Joo. Beyond the highlights: Video retrieval with salient and surrounding contexts. InWACV, 2026. 1
2026
-
[4]
Not only text: Exploring compositionality of visual representations in vision-language models
Davide Berasi, Matteo Farina, Massimiliano Mancini, Elisa Ricci, and Nicola Strisciuglio. Not only text: Exploring compositionality of visual representations in vision-language models. InCVPR, 2025. 4, 1
2025
-
[5]
Black Forest Labs. Flux.1. https://huggingface. co/black-forest-labs/FLUX.1-dev , 2024. 1-dev. 5, 2, 7
2024
-
[6]
Food-101–mining discriminative components with random forests
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InECCV, 2014. 5
2014
-
[7]
Unifying deep local and global features for image search
Bingyi Cao, Andre Araujo, and Jack Sim. Unifying deep local and global features for image search. InECCV, 2020. 2
2020
-
[8]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InICCV,
-
[9]
Patch- wise Retrieval: A bag of practical techniques for instance- level matching
Wonseok Choi, Sohwi Lim, Nam Hyeon-Woo, Moon Ye-Bin, Dong-Ju Jeong, Jinyoung Hwang, and Tae-Hyun Oh. Patch- wise Retrieval: A bag of practical techniques for instance- level matching. InWACV, 2026. 2
2026
-
[10]
Instructblip: Towards general-purpose vision- language models with instruction tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. InNeurIPS, 2023. 6
2023
-
[11]
Histograms of oriented gradi- ents for human detection
Navneet Dalal and Bill Triggs. Histograms of oriented gradi- ents for human detection. InCVPR, 2005. 2
2005
-
[12]
VSC: Visual search compositional text-to-image diffu- sion model
Do Huu Dat, Nam Hyeon-Woo, Po-Yuan Mao, and Tae-Hyun Oh. VSC: Visual search compositional text-to-image diffu- sion model. InICCV, 2025. 8
2025
-
[13]
Ip-composer: Semantic composition of visual concepts
Sara Dorfman, Dana Cohen-Bar, Rinon Gal, and Daniel Cohen-Or. Ip-composer: Semantic composition of visual concepts. InACM Transactions on Graphics (SIGGRAPH), pages 1–11, 2025. 2, 3, 4, 5
2025
-
[14]
Mitigate the gap: Improving cross-modal alignment in CLIP
Sedigheh Eslami and Gerard de Melo. Mitigate the gap: Improving cross-modal alignment in CLIP. InICLR, 2025. 4, 5
2025
-
[15]
Fletcher, Conglin Lu, S.M
P.T. Fletcher, Conglin Lu, S.M. Pizer, and Sarang Joshi. Prin- cipal geodesic analysis for the study of nonlinear statistics of shape.IEEE Transactions on Medical Imaging, 23(8): 995–1005, 2004. 4, 1
2004
-
[16]
Fair diffusion: Instructing text-to-image generation models on fairness,
Fabian Friedrich, Moritz Brack, Lukas Struppek, David Hin- tersdorf, Patrick Schramowski, Sasha Luccioni, and Kristian Kersting. Fair diffusion: Instructing text-to-image generation models on fairness.arXiv preprint arXiv:2302.10893, 2023. 7
-
[17]
Dreamsim: Learning new dimensions of human visual similarity using synthetic data
Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. InNeurIPS, 2023. 1
2023
-
[18]
Deep image retrieval: Learning global representations for image search
Albert Gordo, Jon Almaz ´an, Jerome Revaud, and Diane Lar- lus. Deep image retrieval: Learning global representations for image search. InECCV, 2016. 2
2016
-
[19]
End-to-end learning of deep visual representations for image retrieval.IJCV, 124(2):237–254, 2017
Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Lar- lus. End-to-end learning of deep visual representations for image retrieval.IJCV, 124(2):237–254, 2017. 6
2017
-
[20]
Language-only training of zero-shot com- posed image retrieval
Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, and Sangdoo Yun. Language-only training of zero-shot com- posed image retrieval. InCVPR, 2024. 2, 3
2024
-
[21]
Directional statistics with the spherical nor- mal distribution
Søren Hauberg. Directional statistics with the spherical nor- mal distribution. In2018 21st international conference on information fusion (FUSION), 2018. 4, 1
2018
-
[22]
Scene completion using millions of photographs.ACM Transactions on graphics (TOG), 26(3):4–es, 2007
James Hays and Alexei A Efros. Scene completion using millions of photographs.ACM Transactions on graphics (TOG), 26(3):4–es, 2007. 2
2007
-
[23]
Focallens: Instruction tuning enables zero-shot conditional image representations
Cheng-Yu Hsieh, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Chun-Liang Li, Ranjay Krishna, Oncel Tuzel, and Hadi Pouransari. Focallens: Instruction tuning enables zero-shot conditional image representations. arXiv preprint arXiv:2504.08368, 2025. 2, 3, 6
-
[24]
Spherical linear interpolation and text- anchoring for zero-shot composed image retrieval
Young Kyun Jang, Dat Huynh, Ashish Shah, Wen-Kai Chen, and Ser-Nam Lim. Spherical linear interpolation and text- anchoring for zero-shot composed image retrieval. InECCV,
-
[25]
Clevr: A diagnostic dataset for compositional language and elementary visual reasoning
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InCVPR, 2017. 5, 6
2017
-
[26]
Karkkainen and J
K. Karkkainen and J. Joo. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. InWACV, 2021. 7
2021
-
[27]
Vision-by-language for training-free com- positional image retrieval
Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata. Vision-by-language for training-free com- positional image retrieval. InICLR, 2024. 2, 3, 6
2024
-
[28]
Connor Kilrain, David Carlyn, Julia Chae, Sara Beery, Wei- Lun Chao, and Jianyang Gu. Finer-Personalization Rank: Fine-grained retrieval examines identity preservation for per- sonalized generation.arXiv preprint arXiv:2512.19026, 2025. 1
-
[29]
meol: Training-free instruction-guided multimodal embedder for vector graphics and image retrieval
Kyeong Seon Kim, Baek Seong-Eun, Lee Jung-Mok, and Tae- Hyun Oh. meol: Training-free instruction-guided multimodal embedder for vector graphics and image retrieval. InWACV,
-
[30]
3d object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCVW, 2013. 5
2013
-
[31]
Ryu, and Kangwook Lee
Sehyun Kwon, Jaeseung Park, Minkyu Kim, Jaewoong Cho, Ernest K. Ryu, and Kangwook Lee. Image clustering condi- tioned on text criteria. InICLR, 2024. 5
2024
-
[32]
Mingxin Li, Yanzhao Zhang, Dingkun Long, Chen Keqin, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl- embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026. 3, 6, 7
work page internal anchor Pith review arXiv 2026
-
[33]
OmniPrism: Learning Disentangled Visual Concept for Image Generation
Yangyang Li, Daqing Liu, Wu Liu, Allen He, Xinchen Liu, Yongdong Zhang, and Guoqing Jin. Omniprism: Learning dis- entangled visual concept for image generation.arXiv preprint arXiv:2412.12242, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning
Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. InNeurIPS, 2022. 4, 5, 1
2022
-
[35]
Image retrieval on real-life images with pre- trained vision-and-language models
Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre- trained vision-and-language models. InICCV, 2021. 2, 3
2021
-
[36]
Distinctive image features from scale- invariant keypoints.IJCV, 60(2):91–110, 2004
David G Lowe. Distinctive image features from scale- invariant keypoints.IJCV, 60(2):91–110, 2004. 2
2004
-
[37]
Fine-Grained Visual Classification of Aircraft
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual clas- sification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 5
work page internal anchor Pith review arXiv 2013
-
[38]
arXiv preprint arXiv:2507.04590 , year=
Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, and Semih Yavuz. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.arXiv preprint arXiv:2507.04590, 2025. 3, 6, 7
-
[39]
Csd-var: Content-style decomposition in visual autoregressive models
Quang-Binh Nguyen, Minh Luu, Quang Nguyen, Anh Tran, and Khoi Nguyen. Csd-var: Content-style decomposition in visual autoregressive models. InICCV, 2025. 2, 3, 4, 5
2025
-
[40]
Automated flower classification over a large number of classes
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. InIn- dian Conference on Computer Vision, Graphics and Image Processing, 2008. 5
2008
-
[41]
Large-scale image retrieval with attentive deep local features
Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. Large-scale image retrieval with attentive deep local features. InICCV, 2017. 2
2017
-
[42]
Learning and transferring mid-level image representations using convolutional neural networks
Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring mid-level image representations using convolutional neural networks. InCVPR, 2014. 2
2014
-
[43]
Parkhi, Andrea Vedaldi, Andrew Zisserman, and C
Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. InCVPR, 2012. 5
2012
-
[44]
Revisiting oxford and paris: Large-scale image retrieval benchmarking
Filip Radenovi ´c, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ond ˇrej Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. InCVPR, 2018. 6
2018
-
[45]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021. 1, 2, 3, 6
2021
-
[46]
Learning with average precision: Training image retrieval with a listwise loss
Jerome Revaud, Jon Almaz´an, Rafael S Rezende, and Cesar Roberto de Souza. Learning with average precision: Training image retrieval with a listwise loss. InICCV, 2019. 6
2019
-
[47]
Improving personal- ized search with regularized low-rank parameter updates
Fiona Ryan, Josef Sivic, Fabian Caba Heilbron, Judy Hoff- man, James M Rehg, and Bryan Russell. Improving personal- ized search with regularized low-rank parameter updates. In CVPR, 2025. 1
2025
-
[48]
Pic2word: Mapping pictures to words for zero-shot composed image retrieval
Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. InCVPR, 2023. 2
2023
-
[49]
On the rankability of visual embeddings
Ankit Sonthalia, Arnas Uselis, and Seong Joon Oh. On the rankability of visual embeddings. InNeurIPS, 2025. 8, 6
2025
-
[50]
Genecis: A benchmark for general conditional image similarity
Sagar Vaze, Nicolas Carion, and Ishan Misra. Genecis: A benchmark for general conditional image similarity. InCVPR,
-
[51]
Con- ditional similarity networks
Andreas Veit, Serge Belongie, and Theofanis Karaletsos. Con- ditional similarity networks. InCVPR, 2017. 2, 3
2017
-
[52]
Locality-constrained linear coding for image classification
Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang, and Yihong Gong. Locality-constrained linear coding for image classification. InCVPR, 2010. 5
2010
-
[53]
Cross-modal retrieval with cnn visual features: A new baseline.IEEE transactions on cybernetics, 47(2):449–460, 2016
Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. Cross-modal retrieval with cnn visual features: A new baseline.IEEE transactions on cybernetics, 47(2):449–460, 2016. 2
2016
-
[54]
The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback. InCVPR, 2021. 3
2021
-
[55]
Grouplet: A structured image representation for recognizing human and object interactions
Bangpeng Yao and Li Fei-Fei. Grouplet: A structured image representation for recognizing human and object interactions. InCVPR, 2010. 5
2010
-
[56]
Human action recognition by learning bases of action attributes and parts
Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei. Human action recognition by learning bases of action attributes and parts. InICCV,
-
[57]
TextManiA: Enriching visual feature by text- driven manifold augmentation
Moon Ye-Bin, Jisoo Kim, Hongyeob Kim, Kilho Son, and Tae-Hyun Oh. TextManiA: Enriching visual feature by text- driven manifold augmentation. InICCV, 2023. 3
2023
-
[58]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 1, 2, 3, 6
2023
-
[59]
I want to focus on the human action in this image
Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magiclens: self- supervised image retrieval with open-ended instructions. In ICML, 2024. 2, 3, 6, 7 CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space Supplementary Material Contents A . Method Details 1 B . CLAY-EV AL Constructio...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.