pith. machine review for the scientific record. sign in

arxiv: 2604.14710 · v1 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot composed image retrievalgeodesic mixupimplicit semantic expansionexplicit semantic re-rankingmultimodal large language modelstraining-free methodimage retrievalsemantic composition
0
0 comments X

The pith

G-MIXER uses geodesic mixup across ratios to expand implicit semantics then re-ranks with MLLM explicit descriptions for zero-shot composed image retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a training-free method called G-MIXER that addresses limitations in zero-shot composed image retrieval. Current approaches convert reference image and modification text into explicit text via MLLMs but overlook the fuzzy, multi-way nature of how the modification should combine with the image. G-MIXER instead performs geodesic mixup on the reference image-text pair over a continuous range of ratios to generate a diverse set of composed query features that capture implicit semantics. It then applies explicit semantics from MLLM-generated descriptions to re-rank those candidates. This dual handling of implicit expansion and explicit refinement is presented as the route to higher diversity and accuracy on ZS-CIR benchmarks without any model training.

Core claim

G-MIXER constructs composed query features that reflect the implicit semantics of reference image-text pairs through geodesic mixup over a range of mixup ratios, and builds a diverse candidate set. The generated candidates are then re-ranked using explicit semantics derived from MLLMs, improving both retrieval diversity and accuracy. The method achieves state-of-the-art performance across multiple ZS-CIR benchmarks while remaining training-free.

What carries the argument

Geodesic mixup over a range of mixup ratios for implicit semantic expansion, followed by MLLM-derived explicit semantic re-ranking.

If this is right

  • Achieves state-of-the-art performance on multiple ZS-CIR benchmarks
  • Handles both implicit and explicit semantics in a single pipeline
  • Requires no additional training or fine-tuning
  • Produces more diverse candidate sets than text-only MLLM approaches
  • Improves accuracy by refining the fuzzy composition space

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The ratio-sampling strategy implies that the space of valid image modifications is better approximated by a geodesic path than by a single interpolated point.
  • The separation of implicit feature mixing from explicit re-ranking suggests the same split could be tested in other zero-shot retrieval settings where part of the query is visual and part is textual.
  • If the geodesic assumption holds, similar mixup constructions might reduce dependence on MLLM quality for the implicit component across related multimodal tasks.

Load-bearing premise

Geodesic mixup over a range of ratios accurately constructs composed query features reflecting the implicit semantics of reference image-text pairs, and MLLM-generated explicit descriptions reliably improve re-ranking for diversity and accuracy.

What would settle it

Evaluating the full G-MIXER pipeline on standard ZS-CIR benchmarks and observing no gains in recall or precision metrics and no measurable increase in result diversity relative to prior training-free baselines would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.14710 by Heejae Yang, Jee-Hyong Lee, Jiyoung Lim.

Figure 1
Figure 1. Figure 1: Illustration of our motivation. The query contains implicit information cat, basket present in the image but not mentioned in the text, while the modification text includes explicit attributes such as bench and vegetables. The MLLM based caption generation approach narrows the retrieval scope by converting implicit information from reference image into explicit descriptions. In contrast, the composed query… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of G-MIXER. (1) Generate the Target Image Description and Re-ranking Attributes using MLLM; (2) Perform extended Geodesic Mixup on the features obtained by encoding the Target Image Description and Reference Image with the pretrained CLIP encoder; (3) Filter noisy samples through Explicit Semantic Re-ranking. sulting pseudo word tokens to model the compositional re￾lationship between images and te… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results on CIRCO. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on Fashion-IQ dress. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between the proposed range-based geodesic mixup (orange) and the fixed-ratio mixup (gray). The range-based approach expands the retrieval space by applying a range of mixup ratios, while the fixed-ratio method uses a single static value. 4.5.2. Effect of ER Explicit Semantic Re-ranking refines the first-stage candi￾date set by removing noisy candidate based on explicit in￾formation. The results … view at source ↗
read the original abstract

Composed Image Retrieval (CIR) aims to retrieve target images by integrating a reference image with a corresponding modification text. CIR requires jointly considering the explicit semantics specified in the query and the implicit semantics embedded within its bi-modal composition. Recent training-free Zero-Shot CIR (ZS-CIR) methods leverage Multimodal Large Language Models (MLLMs) to generate detailed target descriptions, converting the implicit information into explicit textual expressions. However, these methods rely heavily on the textual modality and fail to capture the fuzzy retrieval nature that requires considering diverse combinations of candidates. This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER). G-MIXER constructs composed query features that reflect the implicit semantics of reference image-text pairs through geodesic mixup over a range of mixup ratios, and builds a diverse candidate set. The generated candidates are then re-ranked using explicit semantics derived from MLLMs, improving both retrieval diversity and accuracy. Our proposed G-MIXER achieves state-of-the-art performance across multiple ZS-CIR benchmarks, effectively handling both implicit and explicit semantics without additional training. Our code will be available at https://github.com/maya0395/gmixer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes G-MIXER, a training-free method for Zero-Shot Composed Image Retrieval (ZS-CIR). It constructs composed query features via geodesic mixup over a range of ratios in the joint embedding space to expand implicit semantics from reference image-modification text pairs, builds a diverse candidate set from these expanded features, and re-ranks the candidates using explicit semantic descriptions generated by Multimodal Large Language Models (MLLMs). The paper claims this dual handling of implicit and explicit semantics yields state-of-the-art performance on multiple ZS-CIR benchmarks without any additional training.

Significance. If the central claims hold, G-MIXER would advance training-free ZS-CIR by mitigating over-reliance on textual modality and improving retrieval diversity, which prior MLLM-based methods reportedly lack. The approach's use of geodesic paths for semantic expansion and subsequent explicit re-ranking could provide a practical, parameter-free way to model fuzzy compositions. Credit is due for the training-free design and the explicit plan to release code, which supports reproducibility.

major comments (2)
  1. [§3] §3: The geodesic mixup procedure for implicit semantic expansion is presented as the core mechanism, yet no validation is given that the computed geodesics in the pre-trained embedding manifold differ meaningfully from Euclidean linear interpolation or that they preserve/expand semantics as claimed. No manifold curvature analysis, geodesic computation details (e.g., how the metric is defined), or direct ablation against linear mixup baselines appears, which is load-bearing for the implicit-expansion claim.
  2. [§4] §4 (Experiments): The SOTA performance claims across ZS-CIR benchmarks are asserted, but the manuscript provides insufficient detail on implementation (e.g., exact mixup ratio sampling, MLLM prompting, candidate set size), ablation studies isolating the geodesic component, or error analysis to confirm that gains are attributable to the proposed method rather than implementation choices or benchmark specifics.
minor comments (2)
  1. [Abstract / §1] The term 'fuzzy retrieval nature' is used in the abstract and introduction without a precise definition or reference to prior literature on fuzzy matching in retrieval.
  2. [Figures / §3] Figure captions and algorithm pseudocode should explicitly state the embedding model and distance metric used for geodesic computation to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas where additional clarification and evidence can strengthen the presentation of G-MIXER. We address each major comment point by point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3: The geodesic mixup procedure for implicit semantic expansion is presented as the core mechanism, yet no validation is given that the computed geodesics in the pre-trained embedding manifold differ meaningfully from Euclidean linear interpolation or that they preserve/expand semantics as claimed. No manifold curvature analysis, geodesic computation details (e.g., how the metric is defined), or direct ablation against linear mixup baselines appears, which is load-bearing for the implicit-expansion claim.

    Authors: We agree that the manuscript would benefit from explicit validation of the geodesic mixup component. In the revision, we will add details on geodesic computation in the joint embedding space, an empirical comparison (with qualitative examples) demonstrating differences from Euclidean linear interpolation, an ablation study replacing geodesic mixup with linear mixup to quantify gains, and a brief discussion of manifold properties in pre-trained image-text embeddings that motivates the geodesic approach. These additions will directly support the implicit semantic expansion claim. revision: yes

  2. Referee: [§4] §4 (Experiments): The SOTA performance claims across ZS-CIR benchmarks are asserted, but the manuscript provides insufficient detail on implementation (e.g., exact mixup ratio sampling, MLLM prompting, candidate set size), ablation studies isolating the geodesic component, or error analysis to confirm that gains are attributable to the proposed method rather than implementation choices or benchmark specifics.

    Authors: We acknowledge that greater experimental transparency is needed. The revised manuscript will include full implementation details (mixup ratio sampling strategy, MLLM prompting templates, and candidate set size), expanded ablations that isolate the geodesic mixup contribution, and an error analysis section with representative success and failure cases. These changes will help attribute performance improvements to the dual implicit-explicit design. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is a novel algorithmic construction evaluated externally

full rationale

The paper presents G-MIXER as a training-free pipeline that constructs composed query features via geodesic mixup over mixup ratios and re-ranks candidates using MLLM-generated explicit descriptions. No equations or claims reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the central steps are presented as new combinations of existing components (pre-trained embeddings, MLLMs) and are validated on external ZS-CIR benchmarks rather than internal consistency. The derivation chain is self-contained as an empirical proposal without tautological reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; full paper required for ledger assessment.

pith-pipeline@v0.9.0 · 5562 in / 1062 out tokens · 37489 ms · 2026-05-10T11:47:13.038330+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    isearle: Improving textual inversion for zero-shot composed image retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Lorenzo Agnolucci, Alberto Baldrati, Alberto Del Bimbo, and Marco Bertini. isearle: Improving textual inversion for zero-shot composed image retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

  3. [3]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

  4. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3,

  5. [5]

    Effective conditioned and composed im- age retrieval combining clip-based features

    Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Al- berto Del Bimbo. Effective conditioned and composed im- age retrieval combining clip-based features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21466–21474, 2022. 2

  6. [6]

    Zero-shot composed image retrieval with textual inversion

    Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Al- berto Del Bimbo. Zero-shot composed image retrieval with textual inversion. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 15338–15347,

  7. [7]

    A fuzzy linguistic ap- proach generalizing boolean information retrieval: A model and its evaluation.Journal of the American Society for In- formation Science, 44(2):70–82, 1993

    Gloria Bordogna and Gabriella Pasi. A fuzzy linguistic ap- proach generalizing boolean information retrieval: A model and its evaluation.Journal of the American Society for In- formation Science, 44(2):70–82, 1993. 1

  8. [8]

    A region-based fuzzy feature matching approach to content-based image retrieval

    Yixin Chen and James Ze Wang. A region-based fuzzy feature matching approach to content-based image retrieval. IEEE transactions on pattern analysis and machine intelli- gence, 24(9):1252–1267, 2002. 1

  9. [9]

    Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023. 3

  10. [10]

    Image2sentence based asymmetrical zero-shot composed image retrieval.arXiv preprint arXiv:2403.01431,

    Yongchao Du, Min Wang, Wengang Zhou, Shuping Hui, and Houqiang Li. Image2sentence based asymmetrical zero-shot composed image retrieval.arXiv preprint arXiv:2403.01431,

  11. [11]

    Language-only efficient training of zero- shot composed image retrieval–appendix–

    Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, and Sangdoo Yun. Language-only efficient training of zero- shot composed image retrieval–appendix–. 2

  12. [12]

    Egocvr: An egocentric benchmark for fine-grained composed video retrieval

    Thomas Hummel, Shyamgopal Karthik, Mariana-Iuliana Georgescu, and Zeynep Akata. Egocvr: An egocentric benchmark for fine-grained composed video retrieval. In European Conference on Computer Vision, pages 1–17. Springer, 2024. 3

  13. [13]

    Spherical linear interpolation and text- anchoring for zero-shot composed image retrieval

    Young Kyun Jang, Dat Huynh, Ashish Shah, Wen-Kai Chen, and Ser-Nam Lim. Spherical linear interpolation and text- anchoring for zero-shot composed image retrieval. InEu- ropean Conference on Computer Vision, pages 239–254. Springer, 2024. 2

  14. [14]

    Scaling up visual and vision-language representa- tion learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,

  15. [15]

    From clip to dino: Visual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825, 2023

    Dongsheng Jiang, Yuchen Liu, Songlin Liu, Jin’e Zhao, Hao Zhang, Zhen Gao, Xiaopeng Zhang, Jin Li, and Hongkai Xiong. From clip to dino: Visual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825, 2023. 3

  16. [16]

    Vision-by-language for training- free compositional image retrieval.arXiv preprint arXiv:2310.09291, 2023

    Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata. Vision-by-language for training- free compositional image retrieval.arXiv preprint arXiv:2310.09291, 2023. 2, 3, 6

  17. [17]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 1, 3

  18. [18]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3

  19. [19]

    Fine-grained textual inversion network for zero-shot composed image retrieval

    Haoqiang Lin, Haokun Wen, Xuemeng Song, Meng Liu, Yu- peng Hu, and Liqiang Nie. Fine-grained textual inversion network for zero-shot composed image retrieval. InProceed- ings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 240–250, 2024. 2

  20. [20]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 3

  21. [21]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 3

  22. [22]

    Image retrieval on real-life images with pre-trained vision-and-language models

    Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2125–2134, 2021. 2, 5

  23. [23]

    Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019. 6

  24. [24]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 3

  25. [25]

    Instance-level composed image retrieval

    Bill Psomas, George Retsinas, Nikos Efthymiadis, Panagio- tis Filntisis, Yannis Avrithis, Petros Maragos, Ondrej Chum, and Giorgos Tolias. Instance-level composed image retrieval. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025. 2

  26. [26]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 3

  27. [27]

    Pic2word: Mapping pictures to words for zero-shot composed image retrieval

    Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Pic2word: Mapping pictures to words for zero-shot composed image retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19305– 19314, 2023. 1, 2

  28. [28]

    Knowledge-enhanced dual-stream zero-shot composed im- age retrieval

    Yucheng Suo, Fan Ma, Linchao Zhu, and Yi Yang. Knowledge-enhanced dual-stream zero-shot composed im- age retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26951– 26962, 2024. 2

  29. [29]

    A fuzzy model of document retrieval sys- tems.Information Processing & Management, 12(3):177– 187, 1976

    Valiollah Tahani. A fuzzy model of document retrieval sys- tems.Information Processing & Management, 12(3):177– 187, 1976. 1

  30. [30]

    Context-i2w: Mapping images to context-dependent words for accurate zero-shot composed image retrieval

    Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Yue Hu, and Qi Wu. Context-i2w: Mapping images to context-dependent words for accurate zero-shot composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5180–5188, 2024. 1

  31. [31]

    Missing target- relevant information prediction with world model for ac- curate zero-shot composed image retrieval.arXiv preprint arXiv:2503.17109, 2025

    Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Gaopeng Gou, and Qi Wu. Missing target- relevant information prediction with world model for ac- curate zero-shot composed image retrieval.arXiv preprint arXiv:2503.17109, 2025. 6

  32. [32]

    Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot com- posed image retrieval

    Yuanmin Tang, Jue Zhang, Xiaoting Qin, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Lin, Saravan Rajmohan, Dong- mei Zhang, and Qi Wu. Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot com- posed image retrieval. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14400– 14410, 2025. 2, 3, 6

  33. [33]

    Genecis: A benchmark for general conditional image similarity

    Sagar Vaze, Nicolas Carion, and Ishan Misra. Genecis: A benchmark for general conditional image similarity. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6862–6872, 2023. 5

  34. [34]

    Composing text and image for image retrieval-an empirical odyssey

    Nam V o, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. Composing text and image for image retrieval-an empirical odyssey. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6439–6448, 2019. 1, 2

  35. [35]

    Fashion iq: A new dataset towards retrieving images by natural language feedback

    Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307– 11317, 2021. 5

  36. [36]

    Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, et al. Visionllm v2: An end-to-end general- ist multimodal large language model for hundreds of vision- language tasks.Advances in Neural Information Processing Systems, 37:69925–69975, 2024. 3

  37. [37]

    Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image re- trieval

    Zhenyu Yang, Dizhan Xue, Shengsheng Qian, Weiming Dong, and Changsheng Xu. Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image re- trieval. InProceedings of the 47th International ACM SIGIR conference on research and development in information re- trieval, pages 80–90, 2024. 1, 2, 3, 6

  38. [38]

    Filip: Fine-grained interactive language-image pre-training

    Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training.arXiv preprint arXiv:2111.07783, 2021. 3

  39. [39]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 3