arxiv: 2603.12711 · v2 · submitted 2026-03-13 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval

Jing Yang , Hui Xue , Shipeng Zhu , Pengfei Fang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords unsupervised cross-domain image retrievaltext priorphase priorCLIPdual priorsdomain adaptationimage retrievalfeature alignment

0 comments

The pith

A network combining CLIP text prompts and phase features retrieves same-category images across domains without labels

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper addresses unsupervised cross-domain image retrieval by noting that pseudo-labels from clustering lack precision and that feature alignment can degrade semantics. It introduces TPSNet which generates class-specific text prompts using CLIP for each domain as a text prior and extracts domain-invariant phase features as a phase prior. These priors are synergized to guide representation learning and alignment. If successful, this would enable more reliable retrieval in scenarios where labeling data is impractical, such as medical imaging or surveillance across different cameras.

Core claim

The Text-Phase Synergy Network with Dual Priors uses CLIP to create domain-specific text prompts that serve as a text prior for semantic supervision and incorporates domain-invariant phase features as a phase prior to align domains while maintaining semantic content, resulting in improved performance over methods relying on discrete pseudo-labels.

What carries the argument

TPSNet's dual priors mechanism, where the text prior from CLIP provides precise semantic guidance and the phase prior supplies domain-invariant information to prevent semantic degradation during cross-domain alignment.

If this is right

The approach replaces inaccurate discrete pseudo-labels with continuous semantic supervision from text prompts.
Integration of phase features reduces domain gaps without losing category information.
Overall retrieval accuracy increases on standard UCDIR benchmark datasets.
Representations become more robust for matching images from different visual domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar dual-prior strategies might apply to other unsupervised domain adaptation tasks beyond retrieval, such as classification or segmentation.
Exploring phase features in combination with other foundation models could yield further gains in handling domain shifts.
Validation on datasets with extreme domain differences, like sketch to photo, would test the limits of the phase prior's invariance.

Load-bearing premise

CLIP can generate class-specific prompts that give more accurate semantic supervision than clustering-based pseudo-labels, and phase features can be added to image representations without degrading their semantic meaning.

What would settle it

If ablation experiments show that removing either the text prior or the phase prior does not decrease performance on the UCDIR benchmarks, the value of their synergy would be called into question.

Figures

Figures reproduced from arXiv: 2603.12711 by Hui Xue, Jing Yang, Pengfei Fang, Shipeng Zhu.

**Figure 2.** Figure 2: The pipeline of TPSNet. Left: domain prompt generation via the prompt learning paradigm. Top-right: text-phase dual prior [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Average Accuracy (%) of UCDIR Methods using (a) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: t-SNE visualizations of last-layer features for the base [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Grad-CAM visualizations of last-layer features for the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Analysis on the number of learnable text tokens [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Analysis on coefficients α and β on two datasets. 10 6 10 5 10 4 10 3 Learning Rate 50 55 60 65 70 75 80 85 P@1 / P@50 (%) OfficeHome DomainNet [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Analysis on learning rate. on Office-Home and Domainnet datasets, as shown in Figure 8. The model achieves the best performance when 6 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Additional t-SNE and Grad-CAM visualizations of last-layer features for the baseline model (v1), baseline with text prior (v2), [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

read the original abstract

This paper studies unsupervised cross-domain image retrieval (UCDIR), which aims to retrieve images of the same category across different domains without relying on labeled data. Existing methods typically utilize pseudo-labels, derived from clustering algorithms, as supervisory signals for intra-domain representation learning and cross-domain feature alignment. However, these discrete pseudo-labels often fail to provide accurate and comprehensive semantic guidance. Moreover, the alignment process frequently overlooks the entanglement between domain-specific and semantic information, leading to semantic degradation in the learned representations and ultimately impairing retrieval performance. This paper addresses the limitations by proposing a Text-Phase Synergy Network with Dual Priors(TPSNet). Specifically, we first employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior that offers more precise semantic supervision. In parallel, we further introduce a phase prior, represented by domain-invariant phase features, which is integrated into the original image representations to bridge the domain distribution gaps while preserving semantic integrity. Leveraging the synergy of these dual priors, TPSNet significantly outperforms state-of-the-art methods on UCDIR benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TPSNet's dual priors idea targets real issues in UCDIR but the lack of experimental support makes the performance claims hard to evaluate right now.

read the letter

The main point to know is that the authors propose a Text-Phase Synergy Network using CLIP-derived domain prompts as a text prior and phase features as a domain-invariant prior to improve unsupervised cross-domain image retrieval beyond what pseudo-labels can do. What stands out as new is the specific pairing of these two priors to address both semantic accuracy and domain entanglement. The paper explains the problems with existing approaches clearly: pseudo-labels from clustering are too coarse and the alignment step often degrades semantics. Using CLIP for per-domain class prompts aims to supply richer guidance, while phase information helps keep the representations semantically sound across domains. This approach has some merit in its logic. It builds on known strengths of CLIP for semantic understanding and phase for invariance in a way that targets the stated limitations directly. That said, the evidence is thin. There are no details on the model, training procedure, or results tables in the provided description. Without ablations showing the contribution of each prior or comparisons with strong baselines, it's hard to accept the claim of significant outperformance. The potential issue with CLIP's bias toward natural images is worth watching, since UCDIR often involves diverse domains where prompt quality could vary. The citation pattern seems standard for the field, focusing on prior UCDIR and CLIP works. This paper is for computer vision researchers interested in unsupervised domain adaptation for retrieval tasks. Someone looking for ideas on using language models in vision without labels might find it worth a look, but it needs more backing to be fully useful. I would recommend putting it through peer review. The core proposal is coherent enough that referees could provide feedback on the missing experiments and help clarify if the dual priors deliver in practice.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes TPSNet, a Text-Phase Synergy Network with Dual Priors, for unsupervised cross-domain image retrieval (UCDIR). It identifies limitations in existing pseudo-label approaches from clustering, which provide inaccurate semantic guidance and cause semantic degradation during domain alignment. TPSNet introduces a text prior via CLIP-generated class-specific domain prompts for more precise supervision and a phase prior using domain-invariant phase features to bridge domain gaps while preserving semantics. The central claim is that the synergy of these dual priors enables significant outperformance over state-of-the-art methods on UCDIR benchmarks.

Significance. If the dual-prior design is shown to deliver reliable semantic supervision and domain-invariant features without degradation, the work could advance UCDIR by moving beyond discrete pseudo-labels toward multimodal and signal-based priors. The integration of CLIP text embeddings with phase features offers a potentially generalizable strategy for handling domain shifts in retrieval tasks.

major comments (3)

[Abstract] Abstract: The claim that CLIP-generated class-specific domain prompts provide 'more precise and comprehensive semantic supervision' than clustering-derived pseudo-labels is load-bearing for the central contribution, yet the manuscript provides no quantitative comparison (e.g., prompt-to-image similarity vs. clustering purity) or ablation isolating the text prior's contribution; this must be addressed with explicit metrics in the experimental section.
[Abstract] Abstract: The phase prior is asserted to integrate 'without semantic degradation' and bridge domain gaps, but the description contains no equations, extraction procedure, or fusion mechanism; without these details the claim that phase features preserve semantics while aligning distributions cannot be evaluated and is central to the synergy argument.
[Abstract] Abstract: The outperformance claim rests on the assumption that CLIP domain prompts remain reliable across non-natural domains (e.g., sketches, paintings), but CLIP's photo-centric training data introduces a documented bias risk; the paper must include domain-specific ablation or failure-case analysis to substantiate that the text prior outperforms pseudo-labels under this bias.

minor comments (1)

[Abstract] Abstract: Missing space in 'Dual Priors(TPSNet)' should be corrected to 'Dual Priors (TPSNet)'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments help clarify how to better substantiate the dual-prior claims. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that CLIP-generated class-specific domain prompts provide 'more precise and comprehensive semantic supervision' than clustering-derived pseudo-labels is load-bearing for the central contribution, yet the manuscript provides no quantitative comparison (e.g., prompt-to-image similarity vs. clustering purity) or ablation isolating the text prior's contribution; this must be addressed with explicit metrics in the experimental section.

Authors: We agree that direct quantitative evidence strengthens the central claim. In the revised manuscript we will add an experimental subsection reporting prompt-to-image cosine similarity for the CLIP domain prompts versus clustering purity on the same features, together with an ablation that isolates the text prior by comparing full TPSNet against a variant using only the phase prior and pseudo-label supervision. revision: yes
Referee: [Abstract] Abstract: The phase prior is asserted to integrate 'without semantic degradation' and bridge domain gaps, but the description contains no equations, extraction procedure, or fusion mechanism; without these details the claim that phase features preserve semantics while aligning distributions cannot be evaluated and is central to the synergy argument.

Authors: The full manuscript (Section 3.2) already contains the phase extraction via Fourier transform (Equation 2) and the fusion module that concatenates phase features with amplitude features before the encoder (Equations 3-4). To improve accessibility we will insert a short description of the extraction and fusion steps into the abstract and add a schematic figure in the revised version. revision: partial
Referee: [Abstract] Abstract: The outperformance claim rests on the assumption that CLIP domain prompts remain reliable across non-natural domains (e.g., sketches, paintings), but CLIP's photo-centric training data introduces a documented bias risk; the paper must include domain-specific ablation or failure-case analysis to substantiate that the text prior outperforms pseudo-labels under this bias.

Authors: We acknowledge the documented CLIP bias for non-photographic domains. While our current benchmarks already contain sketch and painting subsets, we will add a dedicated ablation table and failure-case discussion that directly compares text-prior performance against pseudo-label baselines on these domains, highlighting cases where CLIP prompts degrade and how the phase prior mitigates the effect. revision: yes

Circularity Check

0 steps flagged

No circularity: priors treated as external inputs

full rationale

The manuscript introduces TPSNet by adopting CLIP-generated domain prompts and phase features as independent priors supplied by external pre-trained models and standard feature extraction. These quantities are not fitted to the target retrieval metric, not defined in terms of the claimed synergy, and not justified via self-citation chains that reduce to the present result. No equations appear that would equate a prediction to a fitted parameter or rename an input as an output. The performance claims rest on benchmark comparisons rather than any self-referential derivation, rendering the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on two domain assumptions about CLIP prompts and phase features whose validity is not independently demonstrated in the provided abstract.

axioms (2)

domain assumption CLIP can generate accurate class-specific prompts per domain that offer more precise semantic supervision than clustering-based pseudo-labels
Invoked to justify the text prior
domain assumption Domain-invariant phase features can be extracted and integrated into image representations while preserving semantic integrity
Invoked to justify the phase prior for bridging domain gaps

invented entities (2)

Text prior (domain prompts from CLIP) no independent evidence
purpose: Provide precise semantic supervision for intra-domain learning and cross-domain alignment
New supervisory signal introduced in the method
Phase prior (domain-invariant phase features) no independent evidence
purpose: Bridge domain distribution gaps without semantic degradation
New alignment mechanism introduced in the method

pith-pipeline@v0.9.0 · 5500 in / 1363 out tokens · 43322 ms · 2026-05-15T11:56:43.396374+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (J-cost uniqueness) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 4 internal anchors

[1]

Self-labelling via simultaneous clustering and representation learning

YM Asano, C Rupprecht, and A Vedaldi. Self-labelling via simultaneous clustering and representation learning. InPro- ceedings of the International Conference on Learning Rep- resentations, 2019. 3

work page 2019
[2]

Prompt- based distribution alignment for unsupervised domain adap- tation

Shuanghao Bai, Min Zhang, Wanqi Zhou, Siteng Huang, Zhirong Luan, Donglin Wang, and Badong Chen. Prompt- based distribution alignment for unsupervised domain adap- tation. InProceedings of the AAAI conference on Artificial Intelligence, pages 729–737, 2024. 2

work page 2024
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Unsupervised learning of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. InPro- ceedings of the Annual Conference on Neural Information Processing Systems, pages 9912–9924, 2020. 3

work page 2020
[5]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE International Conference on Computer Vision, pages 9650–9660, 2021. 6, 3

work page 2021
[6]

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020. 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2003
[7]

An image is worth 16x16 words: Transformers for im- age recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, G Heigold, S Gelly, et al. An image is worth 16x16 words: Transformers for im- age recognition at scale. InProceedings of the International Conference on Learning Representations, 2020. 5

work page 2020
[8]

Pros: Prompting-to-simulate generalized knowledge for universal cross-domain retrieval

Kaipeng Fang, Jingkuan Song, Lianli Gao, Pengpeng Zeng, Zhi-Qi Cheng, Xiyao Li, and Heng Tao Shen. Pros: Prompting-to-simulate generalized knowledge for universal cross-domain retrieval. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 17292–17301, 2024. 1

work page 2024
[9]

Domain-adversarial training of neural networks.Journal of Machine Learning Research, 17(59):1–35, 2016

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas- cal Germain, Hugo Larochelle, Franc ¸ois Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks.Journal of Machine Learning Research, 17(59):1–35, 2016. 3

work page 2016
[10]

Domain adaptation via prompt learning.IEEE Transactions on Neural Networks and Learning Systems, 36(1):1160–1170, 2023

Chunjiang Ge, Rui Huang, Mixue Xie, Zihang Lai, Shiji Song, Shuang Li, and Gao Huang. Domain adaptation via prompt learning.IEEE Transactions on Neural Networks and Learning Systems, 36(1):1160–1170, 2023. 2

work page 2023
[11]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 5

work page 2016
[12]

X-mir: Ex- plainable medical image retrieval

Brian Hu, Bhavan Vasu, and Anthony Hoogs. X-mir: Ex- plainable medical image retrieval. InProceedings of the IEEE Winter Conference on Applications of Computer Vi- sion, pages 440–450, 2022. 1

work page 2022
[13]

Feature representation learn- ing for unsupervised cross-domain image retrieval

Conghui Hu and Gim Hee Lee. Feature representation learn- ing for unsupervised cross-domain image retrieval. InPro- ceedings of the European Conference on Computer Vision, pages 529–544. Springer, 2022. 2, 5, 6, 7, 3

work page 2022
[14]

Unsuper- vised feature representation learning for domain-generalized cross-domain image retrieval

Conghui Hu, Can Zhang, and Gim Hee Lee. Unsuper- vised feature representation learning for domain-generalized cross-domain image retrieval. InProceedings of the IEEE International Conference on Computer Vision, pages 11016– 11025, 2023. 2, 5, 6, 7

work page 2023
[15]

Cross-domain image retrieval with a dual attribute- aware ranking network

Junshi Huang, Rogerio S Feris, Qiang Chen, and Shuicheng Yan. Cross-domain image retrieval with a dual attribute- aware ranking network. InProceedings of the IEEE Inter- national Conference on Computer Vision, pages 1062–1070,

work page
[16]

Cross- domain image retrieval with attention modeling

Xin Ji, Wei Wang, Meihui Zhang, and Yang Yang. Cross- domain image retrieval with attention modeling. InProceed- ings of the 25th ACM International Conference on Multime- dia, pages 1654–1662, 2017. 2

work page 2017
[17]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InProceedings of the International Conference on Machine Learning, pages 4904–4916, 2021. 2

work page 2021
[18]

Face recognition using content based image retrieval for intelli- gent security.International Journal of Advanced Engineer- ing Research and Science, 6(1):91–98, 2019

Sri Karnila, Suhendro Irianto, and Rio Kurniawan. Face recognition using content based image retrieval for intelli- gent security.International Journal of Advanced Engineer- ing Research and Science, 6(1):91–98, 2019. 1

work page 2019
[19]

Vilt: Vision- and-language transformer without convolution or region su- pervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. InProceedings of the International Conference on Machine Learning, pages 5583–5594, 2021. 2

work page 2021
[20]

Unsupervised cross-domain image retrieval via prototypical optimal trans- port

Bin Li, Ye Shi, Qian Yu, and Jingya Wang. Unsupervised cross-domain image retrieval via prototypical optimal trans- port. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3009–3017, 2024. 2, 5, 6, 7, 3

work page 2024
[21]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceed- ings of the International Conference on Machine Learning, pages 12888–12900, 2022. 2, 4, 5

work page 2022
[22]

Do we really need to access the source data? source hypothesis transfer for un- supervised domain adaptation

Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for un- supervised domain adaptation. InProceedings of the Interna- tional Conference on Machine Learning, pages 6028–6039,

work page
[23]

Mm-embed: Universal multimodal retrieval with multimodal llms

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms. InPro- ceedings of the International Conference on Learning Rep- resentations, 2025. 5, 7

work page 2025
[24]

Dida: Disambiguated domain align- ment for cross-domain retrieval with partial labels

Haoran Liu, Ying Ma, Ming Yan, Yingke Chen, Dezhong Peng, and Xu Wang. Dida: Disambiguated domain align- ment for cross-domain retrieval with partial labels. InPro- ceedings of the AAAI conference on Artificial Intelligence, pages 3612–3620, 2024. 2

work page 2024
[25]

Lamra: 9 Large multimodal model as your advanced retrieval assistant

Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: 9 Large multimodal model as your advanced retrieval assistant. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025. 5, 7

work page 2025
[26]

Conditional adversarial domain adapta- tion

Mingsheng Long, ZHANGJIE CAO, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adapta- tion. InProceedings of the Annual Conference on Neural Information Processing Systems, 2018. 3

work page 2018
[27]

The importance of phase in signals.Proceedings of the IEEE, 69(5):529–541, 1981

Alan V Oppenheim and Jae S Lim. The importance of phase in signals.Proceedings of the IEEE, 69(5):529–541, 1981. 2

work page 1981
[28]

Styleclip: Text-driven manipulation of stylegan imagery

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. InProceedings of the IEEE International Conference on Computer Vision, pages 2085–2094, 2021. 2

work page 2085
[29]

Moment matching for multi-source domain adaptation

Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. InProceedings of the IEEE International Conference on Computer Vision, pages 1406–1415, 2019. 5

work page 2019
[30]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InProceedings of the International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 2, 5

work page 2021
[31]

Denseclip: Language-guided dense prediction with context- aware prompting

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 18082– 18091, 2022. 2

work page 2022
[32]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE In- ternational Conference on Computer Vision, pages 618–626,

work page
[33]

Mo- bile product image search by automatic query object extrac- tion

Xiaohui Shen, Zhe Lin, Jonathan Brandt, and Ying Wu. Mo- bile product image search by automatic query object extrac- tion. InProceedings of the European Conference on Com- puter Vision, pages 114–127. Springer, 2012. 1

work page 2012
[34]

Shieldir: Privacy-preserving unsupervised cross-domain image retrieval via dual protec- tion transformation

Zixin Tang, Haihui Fan, Jinchao Zhang, Hui Ma, Xiaoyan Gu, Bo Li, and Weiping Wang. Shieldir: Privacy-preserving unsupervised cross-domain image retrieval via dual protec- tion transformation. InProceedings of the 33rd ACM In- ternational Conference on Multimedia, pages 6383–6392,

work page
[35]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2, 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Visualizing data using t-sne.Journal of Machine Learning Research, 9 (86):2579–2605, 2008

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9 (86):2579–2605, 2008. 8

work page 2008
[37]

Deep hashing network for unsupervised domain adaptation

Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027, 2017. 5

work page 2017
[38]

Unsupervised cross-domain im- age retrieval with semantic-attended mixture-of-experts

Kai Wang, Jiayang Liu, Xing Xu, Jingkuan Song, Xin Liu, and Heng Tao Shen. Unsupervised cross-domain im- age retrieval with semantic-attended mixture-of-experts. In Proceedings of the 47th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval, pages 197–207, 2024. 2, 5, 6, 7, 3

work page 2024
[39]

Semantic feature learn- ing for universal unsupervised cross-domain retrieval

Lixu Wang, Xinyu Du, and Qi Zhu. Semantic feature learn- ing for universal unsupervised cross-domain retrieval. In Proceedings of the Annual Conference on Neural Informa- tion Processing Systems, 2024. 2

work page 2024
[40]

Boost- ing unsupervised domain adaptation: A fourier approach

Mengzhu Wang, Shanshan Wang, Ye Wang, Wei Wang, Tianyi Liang, Junyang Chen, and Zhigang Luo. Boost- ing unsupervised domain adaptation: A fourier approach. Knowledge-Based Systems, 264:110325, 2023. 2, 3

work page 2023
[41]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Correspondence-free domain alignment for unsupervised cross-domain image retrieval

Xu Wang, Dezhong Peng, Ming Yan, and Peng Hu. Correspondence-free domain alignment for unsupervised cross-domain image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10200–10208,

work page
[43]

Boosting fine-grained fashion retrieval with relational knowledge distillation

Ling Xiao and Toshihiko Yamasaki. Boosting fine-grained fashion retrieval with relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8229–8234, 2024. 1

work page 2024
[44]

Fda: Fourier domain adaptation for semantic segmentation

Yanchao Yang and Stefano Soatto. Fda: Fourier domain adaptation for semantic segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 4085–4095, 2020. 2, 3

work page 2020
[45]

Sail-vl2 technical report.arXiv preprint arXiv:2509.14033, 2025

Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang, Hongyuan Dong, Haiyang Yu, Dingkang Yang, Jia- cong Wang, Han Wang, et al. Sail-vl2 technical report.arXiv preprint arXiv:2509.14033, 2025. 5, 7

work page arXiv 2025
[46]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE International Conference on Computer Vision, pages 11975–11986, 2023. 2, 4, 5

work page 2023
[47]

An image of a[X] 1 . . .[X] M

Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adapta- tion. InProceedings of the 36th International Conference on Machine Learning, pages 7404–7413, 2019. 3 10 Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval Supplementary Material In the technical appendices...

work page arXiv 2019