pith. machine review for the scientific record. sign in

arxiv: 2604.05583 · v1 · submitted 2026-04-07 · 💻 cs.CV

Recognition: no theorem link

WRF4CIR: Weight-Regularized Fine-Tuning Network for Composed Image Retrieval

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords composed image retrievalfine-tuningoverfittingvision-language pre-trained modelsweight regularizationadversarial perturbationsgeneralization gap
0
0 comments X

The pith

Adversarial perturbations applied to model weights opposite gradient descent during fine-tuning reduce overfitting in composed image retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current composed image retrieval methods fine-tune vision-language pre-trained models on reference images plus modification texts but overfit badly when only limited training triplets are available. The paper documents a large and consistent generalization gap between training and test performance across several models and datasets. WRF4CIR counters the gap by adding weight regularization: during each fine-tuning step it generates small adversarial perturbations in the direction opposite to the gradient and adds them to the weights. This forces the model to work harder to fit the training data, which the experiments show improves retrieval accuracy on standard benchmarks. A sympathetic reader would care because CIR is a practical search task and the regularization requires no extra data or architectural changes.

Core claim

During fine-tuning of vision-language pre-trained models for composed image retrieval, generating adversarial perturbations to the model weights in the exact opposite direction of the gradient descent step acts as an effective regularizer. This increases the difficulty of fitting the limited training triplets and thereby narrows the previously overlooked generalization gap between training and test performance.

What carries the argument

Weight-regularized fine-tuning via adversarial perturbations generated opposite the gradient direction, which serves as the sole added regularizer inside the standard fine-tuning loop.

If this is right

  • The method narrows the generalization gap across multiple vision-language backbones and multiple CIR datasets.
  • It delivers measurable gains in retrieval metrics while using the same limited triplet supervision as prior approaches.
  • No additional data collection or model architecture changes are needed beyond the perturbation step.
  • The regularization can be inserted into any existing fine-tuning pipeline for CIR.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same opposite-gradient perturbation idea might be tested as a drop-in regularizer for other vision-language tasks that suffer from small labeled sets.
  • It is worth checking whether the perturbation size can stay fixed across datasets or must be scaled with model size or learning rate.
  • The approach may interact with other common regularizers such as dropout or weight decay in ways the paper does not explore.

Load-bearing premise

Generating perturbations to the weights in the opposite direction of the gradient will reduce overfitting without causing training instability or requiring dataset-specific hyper-parameter changes.

What would settle it

On a new held-out CIR benchmark, compare standard fine-tuning against WRF4CIR and measure whether the train-test recall gap shrinks and whether test recall at rank 10 or 50 rises; if neither improvement appears, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2604.05583 by Chaojian Yu, Qinmu Peng, Tongliang Liu, Xinge You, Yizhuo Xu, Yuanjie Shao.

Figure 1
Figure 1. Figure 1: Visualization of learning curve and generalization gap across (a) different baseline methods on FashionIQ (LF: late [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Generalization gap and recall on FashionIQ under [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of our WRF4CIR framework. The reference image and modification text are first fused through the Q [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Specifically, WRF4CIR performs two forward-backward [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case study between the baseline (w/o WRF) and our [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The ablation studies of WRF4CIR. (a) Effect of perturbation strength [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Evaluation of WRF4CIR under different CIR mechanisms, training data sizes, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Composed Image Retrieval (CIR) task aims to retrieve target images based on reference images and modification texts. Current CIR methods primarily rely on fine-tuning vision-language pre-trained models. However, we find that these approaches commonly suffer from severe overfitting, posing challenges for CIR with limited triplet data. To better understand this issue, we present a systematic study of overfitting in VLP-based CIR, revealing a significant and previously overlooked generalization gap across different models and datasets. Motivated by these findings, we introduce WRF4CIR, a Weight-Regularized Fine-tuning network for CIR. Specifically, during the fine-tuning process, we apply adversarial perturbations to the model weights for regularization, where these perturbations are generated in the opposite direction of gradient descent. Intuitively, WRF4CIR increases the difficulty of fitting the training data, which helps mitigate overfitting in CIR under limited triplet supervision. Extensive experiments on benchmark datasets demonstrate that WRF4CIR significantly narrows the generalization gap and achieves substantial improvements over existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper identifies severe overfitting in fine-tuning vision-language pre-trained (VLP) models for Composed Image Retrieval (CIR) under limited triplet supervision, documenting a significant generalization gap across models and datasets via a systematic study. It proposes WRF4CIR, which applies adversarial weight perturbations generated in the opposite direction of gradient descent during fine-tuning to increase fitting difficulty and act as regularization. Extensive experiments on benchmark datasets are claimed to show that WRF4CIR narrows the generalization gap and yields substantial improvements over existing CIR methods.

Significance. If the empirical results are robust, the work contributes a targeted regularization approach for mitigating overfitting in data-scarce VLP fine-tuning for multimodal retrieval tasks. The systematic study of the generalization gap provides useful diagnostic insight into current limitations. The method is defined independently of the evaluation benchmarks, which is a positive aspect for assessing its validity.

major comments (3)
  1. [§3] §3 (Method description): The adversarial perturbation is specified only at a high level as 'generated in the opposite direction of gradient descent.' No details are provided on the perturbation magnitude (epsilon), its schedule during training, which parameters/layers are perturbed, or safeguards against divergence. This directly impacts the weakest assumption that the technique is stable without dataset-specific tuning.
  2. [§4] §4 (Experiments): The central claim of 'substantial improvements' and narrowing the generalization gap lacks reported details on baseline implementations, exact metrics (e.g., Recall@K values), number of runs, error bars, or statistical significance tests. Without these, it is difficult to verify whether gains are reliable or sensitive to hyperparameter choices as flagged in the stress-test note.
  3. [§4.3] Ablation studies (likely in §4.3 or supplementary): No experiments isolate the effect of the anti-gradient direction versus random perturbations, same-direction perturbations, or standard regularizers (e.g., weight decay). This is load-bearing for attributing gains specifically to the proposed mechanism rather than generic regularization.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'significant and previously overlooked generalization gap' is stated without any quantitative values or references to specific tables/figures from the study section.
  2. [Throughout] Notation and terminology: Ensure all acronyms (VLP, CIR) are defined at first use and used consistently; clarify the exact loss function and optimizer setup used in the fine-tuning baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas for improving clarity and rigor in our manuscript. We address each major comment point-by-point below. We agree that additional details and experiments are needed to strengthen the presentation and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (Method description): The adversarial perturbation is specified only at a high level as 'generated in the opposite direction of gradient descent.' No details are provided on the perturbation magnitude (epsilon), its schedule during training, which parameters/layers are perturbed, or safeguards against divergence. This directly impacts the weakest assumption that the technique is stable without dataset-specific tuning.

    Authors: We acknowledge that Section 3 presents the perturbation mechanism at a conceptual level. In the revised manuscript, we will expand the method description to include the specific perturbation magnitude (epsilon value and how it is chosen), the training schedule (e.g., whether it is applied every epoch or with a ramp-up), the parameters or layers affected (e.g., all model weights or selected subsets), and safeguards such as gradient clipping, loss monitoring, or early stopping to prevent divergence. These additions will ensure the technique is fully reproducible and address concerns about stability across datasets. revision: yes

  2. Referee: [§4] §4 (Experiments): The central claim of 'substantial improvements' and narrowing the generalization gap lacks reported details on baseline implementations, exact metrics (e.g., Recall@K values), number of runs, error bars, or statistical significance tests. Without these, it is difficult to verify whether gains are reliable or sensitive to hyperparameter choices as flagged in the stress-test note.

    Authors: We agree that the experimental section requires more transparency to support the claims of improvements and gap narrowing. In the revision, we will report exact Recall@K (and other metrics) for all methods and datasets, detail how baselines were implemented or sourced (including any re-implementations), present results averaged over multiple independent runs with standard deviations or error bars, and include statistical significance tests (e.g., paired t-tests) where relevant. This will allow readers to assess reliability and sensitivity to hyperparameters. revision: yes

  3. Referee: [§4.3] Ablation studies (likely in §4.3 or supplementary): No experiments isolate the effect of the anti-gradient direction versus random perturbations, same-direction perturbations, or standard regularizers (e.g., weight decay). This is load-bearing for attributing gains specifically to the proposed mechanism rather than generic regularization.

    Authors: We recognize that isolating the contribution of the anti-gradient direction is essential to validate the proposed mechanism. We will add dedicated ablation studies in the revised Section 4.3 (or supplementary material) that compare WRF4CIR against variants using random perturbations, same-direction perturbations, and standard weight decay regularization, while keeping other factors controlled. These experiments will quantify the specific benefit of the opposite-direction adversarial perturbation over generic regularization approaches. revision: yes

Circularity Check

0 steps flagged

No significant circularity: method is an independent regularization proposal evaluated empirically

full rationale

The paper identifies overfitting in VLP-based CIR via systematic study, then proposes WRF4CIR as a weight-regularized fine-tuning approach that applies adversarial perturbations to weights in the direction opposite gradient descent. This is presented as a training-time regularization heuristic motivated by the observed generalization gap, not as a mathematical derivation or prediction that reduces to fitted parameters or self-referential definitions. No equations or claims in the provided text equate the method's output to its inputs by construction, nor do they rely on load-bearing self-citations for uniqueness or ansatz smuggling. The central improvements are demonstrated through benchmark experiments rather than tautological renaming or forced statistical outcomes. This is a standard empirical ML contribution with independent content.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach relies on standard deep learning assumptions about optimization and regularization effects. No new entities are postulated. A likely free parameter is the scale of perturbations, chosen to balance fitting difficulty.

free parameters (1)
  • perturbation magnitude or regularization strength
    Controls how strongly the adversarial weight changes are applied; must be set to achieve the claimed regularization benefit.
axioms (1)
  • domain assumption Adversarial perturbations opposite to gradient descent increase the difficulty of fitting training data and thereby reduce overfitting in fine-tuning
    Central to the motivation and method description in the abstract.

pith-pipeline@v0.9.0 · 5493 in / 1332 out tokens · 80180 ms · 2026-05-10T19:28:00.451745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 14 canonical work pages · 3 internal anchors

  1. [1]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

  2. [2]

    Yang Bai, Xinxing Xu, Yong Liu, Salman Khan, Fahad Khan, Wangmeng Zuo, Rick Siow Mong Goh, and Chun-Mei Feng. 2024. Sentence-level prompts benefit composed image retrieval.International Conference on Learning Representations (2024)

  3. [3]

    Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2023. Composed image retrieval using contrastive learning and task-oriented clip- based features.ACM Transactions on Multimedia Computing, Communications and Applications20, 3 (2023), 1–24

  4. [4]

    Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences116, 32 (2019), 15849–15854

  5. [5]

    Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade r-cnn: Delving into high quality object detection. InProceedings of the IEEE conference on computer vision and pattern recognition. 6154–6162

  6. [6]

    Yu Cao, Shawn Steffey, Jianbiao He, Degui Xiao, Cui Tao, Ping Chen, and Henning Müller. 2014. Medical image retrieval: a multimodal approach.Cancer informatics 13 (2014), CIN–S14053

  7. [7]

    Pranit Chawla, Surgan Jandial, Pinkesh Badjatiya, Ayush Chopra, Mausoom Sarkar, and Balaji Krishnamurthy. 2021. Leveraging style and content features for text conditioned image retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3978–3982

  8. [8]

    Yanzhe Chen, Zhiwen Yang, Jinglin Xu, and Yuxin Peng. 2025. MAI: A multi-turn aggregation-iteration model for composed image retrieval. InThe Thirteenth International Conference on Learning Representations

  9. [9]

    Yiyang Chen, Zhedong Zheng, Wei Ji, Leigang Qu, and Tat-Seng Chua. 2022. Composed image retrieval with text feedback via multi-grained uncertainty regularization.arXiv preprint arXiv:2211.07394(2022)

  10. [10]

    Yanzhe Chen, Huasong Zhong, Xiangteng He, Yuxin Peng, Jiahuan Zhou, and Lele Cheng. 2024. FashionERN: enhance-and-refine network for composed fashion image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 1228–1236

  11. [11]

    Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, and Liqiang Nie. 2025. Offset: Segmentation-based focus shift revision for composed image retrieval. InProceedings of the 33rd ACM International Conference on Multimedia. 6113–6122

  12. [12]

    Ginger Delmas, Rafael S Rezende, Gabriela Csurka, and Diane Larlus. 2022. ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity. InInternational Conference on Learning Representations

  13. [13]

    Terrance DeVries and Graham W Taylor. 2017. Improved regularization of convolutional neural networks with cutout.arXiv preprint arXiv:1708.04552 (2017)

  14. [14]

    Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. 2017. Sharp minima can generalize for deep nets. InInternational Conference on Machine Learning. PMLR, 1019–1028

  15. [15]

    Zhangchi Feng, Richong Zhang, and Zhijie Nie. 2024. Improving composed image retrieval via contrastive learning with scaling positives and negatives. In Proceedings of the 32nd ACM International Conference on Multimedia. 1632–1641

  16. [16]

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. 2021. Sharpness-aware minimization for efficiently improving generalization.In- ternational Conference on Learning Representations(2021)

  17. [17]

    Hongfei Ge, Yuanchun Jiang, Jianshan Sun, Kun Yuan, and Yezheng Liu. 2025. Llm-enhanced composed image retrieval: An intent uncertainty-aware linguistic- visual dual channel matching model.ACM Transactions on Information Systems 43, 2 (2025), 1–30

  18. [18]

    Stuart Geman, Elie Bienenstock, and René Doursat. 1992. Neural networks and the bias/variance dilemma.Neural computation4, 1 (1992), 1–58

  19. [19]

    Sonam Goenka, Zhaoheng Zheng, Ayush Jaiswal, Rakesh Chada, Yue Wu, Varsha Hedau, and Pradeep Natarajan. 2022. Fashionvlp: Vision language transformer for fashion retrieval with feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14105–14115

  20. [20]

    Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Roge- rio Feris. 2018. Dialog-based interactive image retrieval.Advances in neural information processing systems31 (2018)

  21. [21]

    Ehtesham Hassan, Santanu Chaudhury, and Madan Gopal. 2013. Multi-modal information integration for document retrieval. In2013 12th International Con- ference on Document Analysis and Recognition. IEEE, 1200–1204

  22. [22]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778

  23. [23]

    Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. 2019. Augmix: A simple data processing method to improve robustness and uncertainty.arXiv preprint arXiv:1912.02781(2019)

  24. [24]

    Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.Neural computation9, 8 (1997), 1735–1780

  25. [25]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3

  26. [26]

    Young Kyun Jang and Donghyun Kim. 2025. Visual delta generation with large multi-modal models enhances composed image retrieval using unlabeled data: YK Jang, D. Kim.Scientific reports15, 1 (2025), 27463

  27. [27]

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision- language representation learning with noisy text supervision. InInternational conference on machine learning. PMLR, 4904–4916

  28. [28]

    Xintong Jiang, Yaxiong Wang, Mengjian Li, Yujiao Wu, Bingwen Hu, and Xuem- ing Qian. 2024. CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2177–2187

  29. [29]

    Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. 2017. Generalization in deep learning.arXiv preprint arXiv:1710.054681, 8 (2017)

  30. [30]

    Hoki Kim, Woojin Lee, and Jaewook Lee. 2021. Understanding catastrophic over- fitting in single-step adversarial training. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 8119–8127

  31. [31]

    Jongseok Kim, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. 2021. Dual com- positional learning in interactive image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1771–1779

  32. [32]

    Seungmin Lee, Dongwan Kim, and Bohyung Han. 2021. Cosmo: Content-style modulation for image retrieval with text feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 802–812

  33. [33]

    Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. 2024. Data roaming and quality assessment for composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 2991–2999

  34. [34]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

  35. [35]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning. PMLR, 12888–12900

  36. [36]

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language repre- sentation learning with momentum distillation.Advances in neural information processing systems34 (2021), 9694–9705

  37. [37]

    Shuxian Li, Changhao He, Xiting Liu, Joey Tianyi Zhou, Xi Peng, and Peng Hu

  38. [38]

    InProceedings of the Computer Vision and Pattern Recognition Conference

    Learning with Noisy Triplet Correspondence for Composed Image Retrieval. InProceedings of the Computer Vision and Pattern Recognition Conference. 19628– 19637

  39. [39]

    Zeju Li, Konstantinos Kamnitsas, and Ben Glocker. 2020. Analyzing overfit- ting under class imbalance in neural networks for image segmentation.IEEE transactions on medical imaging40, 3 (2020), 1065–1077

  40. [40]

    Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould

  41. [41]

    InProceedings of the IEEE/CVF International Conference on Computer Vision

    Image retrieval on real-life images with pre-trained vision-and-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 2125–2134

  42. [42]

    Zheyuan Liu, Weixuan Sun, Yicong Hong, Damien Teney, and Stephen Gould

  43. [43]

    InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Bi-directional training for composed image retrieval via text prompt learning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5753–5762

  44. [44]

    Zheyuan Liu, Weixuan Sun, Damien Teney, and Stephen Gould. 2023. Candidate set re-ranking for composed image retrieval with dual multi-modal encoder. arXiv preprint arXiv:2305.16304(2023)

  45. [45]

    Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. InInternational Conference on Learning Representations

  46. [46]

    Ron Mokady, Amir Hertz, and Amit H Bermano. 2021. Clipcap: Clip prefix for image captioning.arXiv preprint arXiv:2111.09734(2021)

  47. [47]

    Samuel G Müller and Frank Hutter. 2021. Trivialaugment: Tuning-free yet state-of-the-art data augmentation. InProceedings of the IEEE/CVF international conference on computer vision. 774–782

  48. [48]

    Bill Psomas, George Retsinas, Nikos Efthymiadis, Panagiotis Filntisis, Yannis Avrithis, Petros Maragos, Ondrej Chum, and Giorgos Tolias. 2025. Instance-level composed image retrieval. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  49. [49]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

  50. [50]

    In International conference on machine learning

    Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763

  51. [51]

    Leslie Rice, Eric Wong, and Zico Kolter. 2020. Overfitting in adversarially robust deep learning. InInternational conference on machine learning. PMLR, 8093–8104

  52. [52]

    Jiangming Shi, Xiangbo Yin, Yeyun Chen, Yachao Zhang, Zhizhong Zhang, Yuan Xie, and Yanyun Qu. 2025. Multi-Schema Proximity Network for Composed Yizhuo Xu, Chaojian Yu, Yuanjie Shao, Tongliang Liu, Qinmu Peng, Xinge You Image Retrieval. InProceedings of the IEEE/CVF International Conference on Com- puter Vision. 19999–20008

  53. [53]

    Haoyu Song, Li Dong, Wei-Nan Zhang, Ting Liu, and Furu Wei. 2022. Clip models are few-shot learners: Empirical studies on vqa and visual entailment. arXiv preprint arXiv:2203.07190(2022)

  54. [54]

    Xuemeng Song, Haoqiang Lin, Haokun Wen, Bohan Hou, Mingzhu Xu, and Liqiang Nie. 2025. A comprehensive survey on composed image retrieval.ACM Transactions on Information Systems44, 1 (2025), 1–54

  55. [55]

    Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi

  56. [56]

    A Corpus for Reasoning About Natural Language Grounded in Photographs

    A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491(2018)

  57. [57]

    Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. 2019. Meta-transfer learning for few-shot learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 403–412

  58. [58]

    Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, and Siyang Li. 2024. Clip as rnn: Segment countless visual concepts without training endeavor. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13171– 13182

  59. [59]

    Zelong Sun, Dong Jing, Guoxing Yang, Nanyi Fei, and Zhiwu Lu. 2024. Leveraging Large Vision-Language Model as User Intent-aware Encoder for Composed Image Retrieval.arXiv preprint arXiv:2412.11087(2024)

  60. [60]

    Xiaoyu Tao, Xiaopeng Hong, Xinyuan Chang, Songlin Dong, Xing Wei, and Yihong Gong. 2020. Few-shot class-incremental learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12183–12192

  61. [61]

    Likai Tian, Jian Zhao, Zechao Hu, Zhengwei Yang, Hao Li, Lei Jin, Zheng Wang, and Xuelong Li. 2025. CCIN: Compositional Conflict Identification and Neutral- ization for Composed Image Retrieval. InProceedings of the Computer Vision and Pattern Recognition Conference. 3974–3983

  62. [62]

    Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays

  63. [63]

    In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6439–6448

  64. [64]

    Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. InProceedings of the 25th ACM international conference on Multimedia. 154–162

  65. [65]

    Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al

  66. [66]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Image as a foreign language: Beit pretraining for vision and vision-language tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19175–19186

  67. [67]

    Yifan Wang, Wuliang Huang, Yufan Wen, Shunning Liu, and Chun Yuan. 2025. Towards Robust Uncertainty Calibration for Composed Image Retrieval. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  68. [68]

    Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao

  69. [69]

    Simvlm: Sim- ple visual language model pretraining with weak supervision

    Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904(2021)

  70. [70]

    Haokun Wen, Xuemeng Song, Xiaolin Chen, Yinwei Wei, Liqiang Nie, and Tat- Seng Chua. 2024. Simple but effective raw-data level multimodal fusion for composed image retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 229–239

  71. [71]

    Haokun Wen, Xian Zhang, Xuemeng Song, Yinwei Wei, and Liqiang Nie. 2023. Target-guided composed image retrieval. InProceedings of the 31st ACM Interna- tional Conference on Multimedia. 915–923

  72. [72]

    Dongxian Wu, Shu-Tao Xia, and Yisen Wang. 2020. Adversarial weight pertur- bation helps robust generalization.Advances in neural information processing systems33 (2020), 2958–2969

  73. [73]

    Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. 2021. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 11307–11317

  74. [74]

    Yahui Xu, Jiwei Wei, Yi Bin, Yang Yang, Zeyu Ma, and Heng Tao Shen. 2024. Set of diverse queries with uncertainty regularization for composed image retrieval. IEEE Transactions on Circuits and Systems for Video Technology(2024)

  75. [75]

    Yuchen Yang, Min Wang, Wengang Zhou, and Houqiang Li. 2021. Cross-modal joint prediction and alignment for composed query image retrieval. InProceedings of the 29th ACM International Conference on Multimedia. 3303–3311

  76. [76]

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals

  77. [77]

    Understanding deep learning requires rethinking generalization.arXiv preprint arXiv:1611.03530(2016)

  78. [78]

    Kun Zhang, Jingyu Li, Zhe Li, Jingjing Zhang, Fan Li, Yandong Liu, Rui Yan, Zihang Jiang, Nan Chen, Lei Zhang, et al. 2025. Composed multi-modal retrieval: A survey of approaches and applications.arXiv preprint arXiv:2503.01334(2025)

  79. [79]

    Yinan Zhou, Yaxiong Wang, Haokun Lin, Chen Ma, Li Zhu, and Zhedong Zheng

  80. [80]

    Scale Up Composed Image Retrieval Learning via Modification Text Gen- eration.arXiv preprint arXiv:2504.05316(2025)