pith. machine review for the scientific record. sign in

arxiv: 2604.11539 · v1 · submitted 2026-04-13 · 💻 cs.CV · cs.AI

Recognition: unknown

CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords conditional visual similarityvision-language modelsimage retrievalembedding space modulationno-training adaptationmulti-conditioned retrievalCLAY-EVAL dataset
0
0 comments X

The pith

CLAY reframes pretrained vision-language embeddings as text-conditional similarity spaces for flexible image retrieval without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that visual similarity metrics can be made adaptive to user-specified text conditions by modulating the space of existing pretrained models rather than learning new ones. This would matter because it lets retrieval systems respond to multiple simultaneous interests without the heavy cost of fine-tuning or maintaining separate models. CLAY keeps visual embeddings fixed and applies text conditions only at similarity computation time. It introduces a synthetic dataset to test this under varied conditions and reports strong accuracy with better efficiency than prior approaches.

Core claim

CLAY achieves high retrieval accuracy and notable computational efficiency by reframing the embedding space of pretrained Vision-Language Models as a text-conditional similarity space without additional training. This separates textual conditioning from visual feature extraction to support multi-conditioned retrieval using fixed visual embeddings.

What carries the argument

The modulation of similarity computation in the VLM embedding space by textual conditions, which produces conditional similarities while visual embeddings remain unchanged and no fine-tuning occurs.

If this is right

  • Image retrieval can incorporate multiple user conditions at once without increased model size.
  • Systems can avoid retraining or fine-tuning for new similarity criteria.
  • Computational costs drop because visual features are extracted once and reused.
  • The CLAY-EVAL dataset enables standardized testing of conditional retrieval methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Search interfaces could let users combine conditions on the fly for more personalized results.
  • Similar modulation might apply to other embedding spaces like audio or text-only models.
  • Future work could test if this holds when conditions conflict or are very specific.
  • Deployment on edge devices becomes feasible due to the efficiency gains.

Load-bearing premise

Existing pretrained vision-language models already encode enough information in their embeddings that text can selectively activate the right similarities without distorting the space.

What would settle it

A test where for a given text condition like 'focus on color', the top retrieved images match human judgment on similarity under that condition at rates no better than random or fixed similarity baselines.

Figures

Figures reproduced from arXiv: 2604.11539 by Jungjoon Park, Lee Hyoseok, Sohwi Lim, Tae-Hyun Oh.

Figure 1
Figure 1. Figure 1: Our proposed concept-based conditional image retrieval method retrieves images focusing on the semantic aspects specified by the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the concept of CLAY. Our method adap￾tively computes conditional similarity between images by modulat￾ing the original similarity space into a conditional similarity space within the representation space of VLMs. vious methods dedicated to single-condition settings [23, 50]. Due to a lack of a standard benchmark of a multi-conditional retrieval task, we construct a synthetic evaluation data… view at source ↗
Figure 3
Figure 3. Figure 3: Conditional similarity computation pipeline. (a) Given a condition, we construct the manifold-aware textual subspace with the condition text features in advance, and generate the condition-aware projection matrix Pc. (b) At inference, we compute the conditional similarity between the query and database images by projecting the visual features onto the textual subspace with Pc. leverages the visual features… view at source ↗
Figure 4
Figure 4. Figure 4: Our CLAY-EVAL dataset statistics. We construct a synthetic dataset with diverse condition annotations, consisting of (a) object entity and (b) human entity. For both, the left column shows sample images demonstrating visual naturalness, and the right column visualizes the distributions of key attributes, show￾ing diversity. Percentages are truncated to one decimal place and annotation text labels are abbre… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of our method with competing methods. For each query image and condition text pair, we compare the top-5 retrieved results from (a) CLIP-B, (b) InstructBLIP, (c) GeneCIS, and (d) our method. We also report Average Precision (AP) in each retrieval result. Green boxes indicate correctly retrieved images, while incorrect retrievals are shown in red boxes. [Location] [Species] Query Top-… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative result on Oxfordpets dataset. We visual￾ize the top-5 retrieved results with condition dog species and location. Since no ground-truth location labels are available, we present qualitative examples only. performance across diverse datasets and varying condition types without incurring high computational overhead in the symmetric setting. In addition, CLAY can easily be ex￾tended to multi-condit… view at source ↗
Figure 7
Figure 7. Figure 7: Representation space visualization with t-SNE. We report t-SNE of CLIP-B and ours (CLIP-B) on CLAY-Human under condition (a) action, (b) background, and (c) age. The features with the same label are shown in the same color for easy interpretation. Compared to the fixed representation space in CLIP-B, our method forms more discriminative spaces compliant with given conditions. Specifically, since GeneCIS fo… view at source ↗
read the original abstract

Human perception of visual similarity is inherently adaptive and subjective, depending on the users' interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves high retrieval accuracy and notable computational efficiency compared to previous works.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes CLAY, a training-free method that reframes the embedding space of pretrained Vision-Language Models as a text-conditional similarity space. Textual conditioning is separated from visual feature extraction, with visual embeddings kept fixed to enable efficient multi-conditioned image retrieval. The authors introduce the synthetic CLAY-EVAL dataset and report that experiments on standard datasets plus CLAY-EVAL demonstrate high retrieval accuracy and computational efficiency relative to prior work.

Significance. If the central claims hold, the approach would enable flexible, user-specified conditional retrieval without model updates or fine-tuning, offering clear efficiency advantages for practical systems. The fixed-embedding design and introduction of CLAY-EVAL for diverse conditioned settings are positive contributions. Significance is limited by the untested premise that pretrained VLM spaces already encode the structures needed for arbitrary conditions.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The claims of high retrieval accuracy and notable efficiency are stated without any quantitative numbers, error bars, baseline tables, or ablation results in the abstract and are only summarized at high level in the experiments section; this absence directly undermines evaluation of the central no-training claim.
  2. [§3] §3 (Method): The modulation step that produces conditional similarity from fixed visual embeddings is presented as sufficient for arbitrary text conditions, yet no analysis, failure-case study, or out-of-distribution test is supplied to show that the pretrained space contains the required non-linear feature re-weightings; this assumption is load-bearing for the entire training-free result.
minor comments (2)
  1. [§3.2] The description of how multiple conditions are combined could be made more precise with an explicit formula or pseudocode in the method section.
  2. [§4.1] CLAY-EVAL construction details (prompt templates, condition diversity metrics) should be expanded in the supplementary material or a dedicated subsection to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comments point by point below and outline the revisions we plan to make to improve the clarity and substantiation of our claims.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The claims of high retrieval accuracy and notable efficiency are stated without any quantitative numbers, error bars, baseline tables, or ablation results in the abstract and are only summarized at high level in the experiments section; this absence directly undermines evaluation of the central no-training claim.

    Authors: We agree that the abstract and experiments section would benefit from more specific quantitative details to support the claims. In the revised version, we will update the abstract to include key numerical results, such as the top-1 retrieval accuracy on CLAY-EVAL and efficiency comparisons (e.g., inference time reductions). We will also expand §4 to present full baseline tables, ablation studies on the modulation components, and error bars from multiple experimental runs, providing a more rigorous evaluation of the training-free method. revision: yes

  2. Referee: [§3] §3 (Method): The modulation step that produces conditional similarity from fixed visual embeddings is presented as sufficient for arbitrary text conditions, yet no analysis, failure-case study, or out-of-distribution test is supplied to show that the pretrained space contains the required non-linear feature re-weightings; this assumption is load-bearing for the entire training-free result.

    Authors: This is a valid point regarding the foundational assumption of our approach. While the empirical results across multiple datasets, including the diverse conditioned scenarios in CLAY-EVAL, demonstrate the effectiveness of the modulation in practice, we recognize the value of additional analysis. We will add to the revised manuscript a discussion in §3 on the properties of the pretrained embedding space that enable the conditional similarity, along with selected failure cases and tests on out-of-distribution conditions to better illustrate the limits and capabilities of the method without requiring training. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's core proposal is a methodological reframing of pretrained VLM embedding spaces into a text-conditional similarity space via separation of conditioning and fixed visual feature extraction, with no additional training. The abstract and description contain no equations, fitted parameters, or predictions that reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known results are merely renamed. The approach relies on external pretrained models as independent inputs, making the derivation non-circular and self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, axioms, or invented entities beyond standard use of pretrained VLMs and a new synthetic evaluation dataset.

pith-pipeline@v0.9.0 · 5434 in / 1025 out tokens · 65629 ms · 2026-05-10T16:17:07.345608+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 8 canonical work pages · 3 internal anchors

  1. [1]

    Analyzing clip’s performance limitations in multi-object scenarios: A controlled high-resolution study

    Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Ba- nayeeanzade, Mohammad Hossein Rohban, and Mahdieh So- leymani Baghshah. Analyzing clip’s performance limitations in multi-object scenarios: A controlled high-resolution study. arXiv preprint arXiv:2502.19828, 2025. 7

  2. [2]

    Zero-shot composed image retrieval with textual inversion

    Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. Zero-shot composed image retrieval with textual inversion. InICCV, 2023. 2, 3, 6

  3. [3]

    Beyond the highlights: Video retrieval with salient and surrounding contexts

    Jaehun Bang, Moon Ye-Bin, Tae-Hyun Oh, and Kyungdon Joo. Beyond the highlights: Video retrieval with salient and surrounding contexts. InWACV, 2026. 1

  4. [4]

    Not only text: Exploring compositionality of visual representations in vision-language models

    Davide Berasi, Matteo Farina, Massimiliano Mancini, Elisa Ricci, and Nicola Strisciuglio. Not only text: Exploring compositionality of visual representations in vision-language models. InCVPR, 2025. 4, 1

  5. [5]

    Black Forest Labs. Flux.1. https://huggingface. co/black-forest-labs/FLUX.1-dev , 2024. 1-dev. 5, 2, 7

  6. [6]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InECCV, 2014. 5

  7. [7]

    Unifying deep local and global features for image search

    Bingyi Cao, Andre Araujo, and Jack Sim. Unifying deep local and global features for image search. InECCV, 2020. 2

  8. [8]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InICCV,

  9. [9]

    Patch- wise Retrieval: A bag of practical techniques for instance- level matching

    Wonseok Choi, Sohwi Lim, Nam Hyeon-Woo, Moon Ye-Bin, Dong-Ju Jeong, Jinyoung Hwang, and Tae-Hyun Oh. Patch- wise Retrieval: A bag of practical techniques for instance- level matching. InWACV, 2026. 2

  10. [10]

    Instructblip: Towards general-purpose vision- language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. InNeurIPS, 2023. 6

  11. [11]

    Histograms of oriented gradi- ents for human detection

    Navneet Dalal and Bill Triggs. Histograms of oriented gradi- ents for human detection. InCVPR, 2005. 2

  12. [12]

    VSC: Visual search compositional text-to-image diffu- sion model

    Do Huu Dat, Nam Hyeon-Woo, Po-Yuan Mao, and Tae-Hyun Oh. VSC: Visual search compositional text-to-image diffu- sion model. InICCV, 2025. 8

  13. [13]

    Ip-composer: Semantic composition of visual concepts

    Sara Dorfman, Dana Cohen-Bar, Rinon Gal, and Daniel Cohen-Or. Ip-composer: Semantic composition of visual concepts. InACM Transactions on Graphics (SIGGRAPH), pages 1–11, 2025. 2, 3, 4, 5

  14. [14]

    Mitigate the gap: Improving cross-modal alignment in CLIP

    Sedigheh Eslami and Gerard de Melo. Mitigate the gap: Improving cross-modal alignment in CLIP. InICLR, 2025. 4, 5

  15. [15]

    Fletcher, Conglin Lu, S.M

    P.T. Fletcher, Conglin Lu, S.M. Pizer, and Sarang Joshi. Prin- cipal geodesic analysis for the study of nonlinear statistics of shape.IEEE Transactions on Medical Imaging, 23(8): 995–1005, 2004. 4, 1

  16. [16]

    Fair diffusion: Instructing text-to-image generation models on fairness,

    Fabian Friedrich, Moritz Brack, Lukas Struppek, David Hin- tersdorf, Patrick Schramowski, Sasha Luccioni, and Kristian Kersting. Fair diffusion: Instructing text-to-image generation models on fairness.arXiv preprint arXiv:2302.10893, 2023. 7

  17. [17]

    Dreamsim: Learning new dimensions of human visual similarity using synthetic data

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. InNeurIPS, 2023. 1

  18. [18]

    Deep image retrieval: Learning global representations for image search

    Albert Gordo, Jon Almaz ´an, Jerome Revaud, and Diane Lar- lus. Deep image retrieval: Learning global representations for image search. InECCV, 2016. 2

  19. [19]

    End-to-end learning of deep visual representations for image retrieval.IJCV, 124(2):237–254, 2017

    Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Lar- lus. End-to-end learning of deep visual representations for image retrieval.IJCV, 124(2):237–254, 2017. 6

  20. [20]

    Language-only training of zero-shot com- posed image retrieval

    Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, and Sangdoo Yun. Language-only training of zero-shot com- posed image retrieval. InCVPR, 2024. 2, 3

  21. [21]

    Directional statistics with the spherical nor- mal distribution

    Søren Hauberg. Directional statistics with the spherical nor- mal distribution. In2018 21st international conference on information fusion (FUSION), 2018. 4, 1

  22. [22]

    Scene completion using millions of photographs.ACM Transactions on graphics (TOG), 26(3):4–es, 2007

    James Hays and Alexei A Efros. Scene completion using millions of photographs.ACM Transactions on graphics (TOG), 26(3):4–es, 2007. 2

  23. [23]

    Focallens: Instruction tuning enables zero-shot conditional image representations

    Cheng-Yu Hsieh, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Chun-Liang Li, Ranjay Krishna, Oncel Tuzel, and Hadi Pouransari. Focallens: Instruction tuning enables zero-shot conditional image representations. arXiv preprint arXiv:2504.08368, 2025. 2, 3, 6

  24. [24]

    Spherical linear interpolation and text- anchoring for zero-shot composed image retrieval

    Young Kyun Jang, Dat Huynh, Ashish Shah, Wen-Kai Chen, and Ser-Nam Lim. Spherical linear interpolation and text- anchoring for zero-shot composed image retrieval. InECCV,

  25. [25]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InCVPR, 2017. 5, 6

  26. [26]

    Karkkainen and J

    K. Karkkainen and J. Joo. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. InWACV, 2021. 7

  27. [27]

    Vision-by-language for training-free com- positional image retrieval

    Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata. Vision-by-language for training-free com- positional image retrieval. InICLR, 2024. 2, 3, 6

  28. [28]

    Finer-Personalization Rank: Fine-grained retrieval examines identity preservation for per- sonalized generation.arXiv preprint arXiv:2512.19026, 2025

    Connor Kilrain, David Carlyn, Julia Chae, Sara Beery, Wei- Lun Chao, and Jianyang Gu. Finer-Personalization Rank: Fine-grained retrieval examines identity preservation for per- sonalized generation.arXiv preprint arXiv:2512.19026, 2025. 1

  29. [29]

    meol: Training-free instruction-guided multimodal embedder for vector graphics and image retrieval

    Kyeong Seon Kim, Baek Seong-Eun, Lee Jung-Mok, and Tae- Hyun Oh. meol: Training-free instruction-guided multimodal embedder for vector graphics and image retrieval. InWACV,

  30. [30]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCVW, 2013. 5

  31. [31]

    Ryu, and Kangwook Lee

    Sehyun Kwon, Jaeseung Park, Minkyu Kim, Jaewoong Cho, Ernest K. Ryu, and Kangwook Lee. Image clustering condi- tioned on text criteria. InICLR, 2024. 5

  32. [32]

    Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    Mingxin Li, Yanzhao Zhang, Dingkun Long, Chen Keqin, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl- embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026. 3, 6, 7

  33. [33]

    OmniPrism: Learning Disentangled Visual Concept for Image Generation

    Yangyang Li, Daqing Liu, Wu Liu, Allen He, Xinchen Liu, Yongdong Zhang, and Guoqing Jin. Omniprism: Learning dis- entangled visual concept for image generation.arXiv preprint arXiv:2412.12242, 2024. 5

  34. [34]

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

    Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. InNeurIPS, 2022. 4, 5, 1

  35. [35]

    Image retrieval on real-life images with pre- trained vision-and-language models

    Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre- trained vision-and-language models. InICCV, 2021. 2, 3

  36. [36]

    Distinctive image features from scale- invariant keypoints.IJCV, 60(2):91–110, 2004

    David G Lowe. Distinctive image features from scale- invariant keypoints.IJCV, 60(2):91–110, 2004. 2

  37. [37]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual clas- sification of aircraft.arXiv preprint arXiv:1306.5151, 2013. 5

  38. [38]

    arXiv preprint arXiv:2507.04590 , year=

    Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, and Semih Yavuz. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.arXiv preprint arXiv:2507.04590, 2025. 3, 6, 7

  39. [39]

    Csd-var: Content-style decomposition in visual autoregressive models

    Quang-Binh Nguyen, Minh Luu, Quang Nguyen, Anh Tran, and Khoi Nguyen. Csd-var: Content-style decomposition in visual autoregressive models. InICCV, 2025. 2, 3, 4, 5

  40. [40]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. InIn- dian Conference on Computer Vision, Graphics and Image Processing, 2008. 5

  41. [41]

    Large-scale image retrieval with attentive deep local features

    Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. Large-scale image retrieval with attentive deep local features. InICCV, 2017. 2

  42. [42]

    Learning and transferring mid-level image representations using convolutional neural networks

    Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring mid-level image representations using convolutional neural networks. InCVPR, 2014. 2

  43. [43]

    Parkhi, Andrea Vedaldi, Andrew Zisserman, and C

    Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. InCVPR, 2012. 5

  44. [44]

    Revisiting oxford and paris: Large-scale image retrieval benchmarking

    Filip Radenovi ´c, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ond ˇrej Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. InCVPR, 2018. 6

  45. [45]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021. 1, 2, 3, 6

  46. [46]

    Learning with average precision: Training image retrieval with a listwise loss

    Jerome Revaud, Jon Almaz´an, Rafael S Rezende, and Cesar Roberto de Souza. Learning with average precision: Training image retrieval with a listwise loss. InICCV, 2019. 6

  47. [47]

    Improving personal- ized search with regularized low-rank parameter updates

    Fiona Ryan, Josef Sivic, Fabian Caba Heilbron, Judy Hoff- man, James M Rehg, and Bryan Russell. Improving personal- ized search with regularized low-rank parameter updates. In CVPR, 2025. 1

  48. [48]

    Pic2word: Mapping pictures to words for zero-shot composed image retrieval

    Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. InCVPR, 2023. 2

  49. [49]

    On the rankability of visual embeddings

    Ankit Sonthalia, Arnas Uselis, and Seong Joon Oh. On the rankability of visual embeddings. InNeurIPS, 2025. 8, 6

  50. [50]

    Genecis: A benchmark for general conditional image similarity

    Sagar Vaze, Nicolas Carion, and Ishan Misra. Genecis: A benchmark for general conditional image similarity. InCVPR,

  51. [51]

    Con- ditional similarity networks

    Andreas Veit, Serge Belongie, and Theofanis Karaletsos. Con- ditional similarity networks. InCVPR, 2017. 2, 3

  52. [52]

    Locality-constrained linear coding for image classification

    Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang, and Yihong Gong. Locality-constrained linear coding for image classification. InCVPR, 2010. 5

  53. [53]

    Cross-modal retrieval with cnn visual features: A new baseline.IEEE transactions on cybernetics, 47(2):449–460, 2016

    Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. Cross-modal retrieval with cnn visual features: A new baseline.IEEE transactions on cybernetics, 47(2):449–460, 2016. 2

  54. [54]

    The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback

    Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback. InCVPR, 2021. 3

  55. [55]

    Grouplet: A structured image representation for recognizing human and object interactions

    Bangpeng Yao and Li Fei-Fei. Grouplet: A structured image representation for recognizing human and object interactions. InCVPR, 2010. 5

  56. [56]

    Human action recognition by learning bases of action attributes and parts

    Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei. Human action recognition by learning bases of action attributes and parts. InICCV,

  57. [57]

    TextManiA: Enriching visual feature by text- driven manifold augmentation

    Moon Ye-Bin, Jisoo Kim, Hongyeob Kim, Kilho Son, and Tae-Hyun Oh. TextManiA: Enriching visual feature by text- driven manifold augmentation. InICCV, 2023. 3

  58. [58]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 1, 2, 3, 6

  59. [59]

    I want to focus on the human action in this image

    Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magiclens: self- supervised image retrieval with open-ended instructions. In ICML, 2024. 2, 3, 6, 7 CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space Supplementary Material Contents A . Method Details 1 B . CLAY-EV AL Constructio...