Your Embedding Model is SMARTer Than You Think

Donghyun Kim; Hyun Jung Lee; Jianrui Zhang; Sukanta Ganguly; Tae-Eui Kam; Yong Jae Lee

arxiv: 2605.24938 · v1 · pith:QHO5O2TFnew · submitted 2026-05-24 · 💻 cs.IR · cs.AI· cs.CV

Your Embedding Model is SMARTer Than You Think

Jianrui Zhang , Hyun Jung Lee , Sukanta Ganguly , Tae-Eui Kam , Donghyun Kim , Yong Jae Lee This is my paper

Pith reviewed 2026-06-29 23:54 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CV

keywords multimodal retrievalsingle-vector modelslate interactionhidden statescontrastive trainingplug-and-play upgradevisual document retrieval

0 comments

The pith

Single-vector embedding models already encode effective multi-vector retrieval in their frozen hidden states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard contrastive training on a pooled embedding already organizes the geometry of earlier hidden states through gradient flow, making direct late interaction on those states effective at inference time. SMART applies this late interaction without any retraining or adaptation, turning existing single-vector models into stronger multi-vector retrievers across modalities. A sympathetic reader would care because the change requires no extra parameters or training yet raises performance on multimodal benchmarks and even on state-of-the-art models. The same principle also supports lightweight post-training that lets a single-vector model surpass dedicated multi-vector systems on visual document retrieval.

Core claim

Standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. Simple lightweight post-training further improves results on visual document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts.

What carries the argument

Direct late-interaction over frozen hidden states whose geometry was shaped by gradient flow from pooled-embedding contrastive training.

Load-bearing premise

Gradient flow from training the final pooled embedding already arranges the earlier hidden states into a geometry that supports effective late interaction without further adaptation.

What would settle it

Applying late interaction to the hidden states of a trained single-vector model and measuring no gain or a performance drop on MMEB-V2 or similar retrieval benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.24938 by Donghyun Kim, Hyun Jung Lee, Jianrui Zhang, Sukanta Ganguly, Tae-Eui Kam, Yong Jae Lee.

**Figure 2.** Figure 2: Controlled local-evidence toy benchmark. Each query specifies a local code–marker [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative visualization of SMART on image-to-image retrieval. Each row shows the [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Additional qualitative examples where the original single-vector retriever fails but SMART [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Token-level visualization of SMART. For selected query image tokens, we show the [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global representation. While efficient, they discard fine-grained, local evidence critical for dense retrieval tasks. Multi-vector approaches were introduced as a solution, but they strictly require training and many ignore the necessity of a globally summarizing representation. To address this, we introduce SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models. We first demonstrate that standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. We also reveal SMART's superior performance, as simple lightweight post-training not only saves time and compute, but also brings forth further improvement on Visual Document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts. Ultimately, SMART offers both a highly efficient inference enhancement and a powerful finetuning technique for multimodal retrieval. We open source our code and weights at https://github.com/HanSolo9682/SMART.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SMART shows that late interaction on frozen hidden states from contrastively trained single-vector models can improve retrieval, but the mechanism claim needs stronger evidence than performance numbers alone.

read the letter

The main new piece here is the observation that you can take a standard single-vector multimodal retriever, skip any retraining, and run max-similarity late interaction directly on its token hidden states at inference time. The paper reports this gives consistent gains across modalities and even lifts some SOTA models on MMEB-V2. They also add a lightweight post-training step that lets a single-vector model beat multi-vector baselines on visual document retrieval. Opening the code and weights is useful for anyone who wants to test it quickly.

The soft spot is the central premise that contrastive training on the pooled embedding already shapes the preceding hidden states into a geometry that supports effective late interaction. The abstract frames this as an implicit consequence of gradient flow, but the stress-test note is right that we do not yet see a derivation, gradient inspection, or ablation that isolates the pooled loss effect on token representations versus the final vector. Without that, the plug-and-play gains could still be real but tied to the specific models or datasets rather than a general property.

This is worth a serious referee for the IR retrieval crowd. The idea is practical if the experiments hold up with proper controls, and the open release lowers the barrier to checking it. I would send it out rather than desk reject, mainly to get clarity on the mechanism and the ablation depth.

Referee Report

2 major / 2 minor

Summary. The paper introduces SMART, a framework claiming that standard contrastive training on the pooled embedding of single-vector multimodal models implicitly shapes the retrieval geometry of preceding hidden states via gradient flow; applying direct late-interaction over these frozen states at inference yields consistent performance gains across modalities (including on SOTA models on MMEB-V2) as a plug-and-play upgrade, with additional gains from lightweight post-training that can outperform multi-vector SOTA on visual document retrieval.

Significance. If the empirical results and the implicit-geometry premise hold under rigorous controls, the work would provide a practical, low-overhead bridge between single-vector efficiency and multi-vector expressivity in multimodal IR, reducing the need for specialized multi-vector training while enabling both inference-time enhancement and efficient finetuning.

major comments (2)

[Abstract] Abstract and the demonstration section: the central premise that 'standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow' is load-bearing for the plug-and-play claim, yet the manuscript supplies no gradient analysis, token-level similarity ablations, or controls that isolate the pooled contrastive objective's effect on individual hidden states versus the pooled vector.
[Experimental results] Experimental results on MMEB-V2 and visual document retrieval: without reported ablations that disable the contrastive loss during pre-training (or compare against randomly initialized hidden states) while measuring late-interaction gains, it remains unclear whether the observed improvements are a general consequence of contrastive training or an artifact of the specific models and datasets tested.

minor comments (2)

[Abstract] The abstract and introduction would benefit from explicit definitions of the late-interaction operator (e.g., max-similarity over token pairs) and the precise pooling function used during training.
Figure captions and tables should include error bars or statistical significance tests for the reported improvements over baselines.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We are grateful to the referee for the constructive feedback on strengthening the evidence for our central premise. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract and the demonstration section: the central premise that 'standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow' is load-bearing for the plug-and-play claim, yet the manuscript supplies no gradient analysis, token-level similarity ablations, or controls that isolate the pooled contrastive objective's effect on individual hidden states versus the pooled vector.

Authors: We acknowledge that the manuscript relies on empirical demonstration rather than explicit gradient analysis or token-level ablations. The consistent gains from applying late interaction to frozen hidden states of contrastively trained models (including SOTA models on MMEB-V2) serve as evidence that the pooled objective has shaped preceding states via gradient flow; such gains would be unlikely otherwise. We will add token-level similarity ablations comparing hidden-state geometry in the revised manuscript. revision: yes
Referee: [Experimental results] Experimental results on MMEB-V2 and visual document retrieval: without reported ablations that disable the contrastive loss during pre-training (or compare against randomly initialized hidden states) while measuring late-interaction gains, it remains unclear whether the observed improvements are a general consequence of contrastive training or an artifact of the specific models and datasets tested.

Authors: We agree such controls would be ideal. However, they require retraining large multimodal models without contrastive loss, which is computationally prohibitive. Our results instead demonstrate gains across multiple independently trained contrastive models and tasks. We will add a limitations discussion clarifying the scope of the empirical evidence. revision: partial

standing simulated objections not resolved

Ablations disabling contrastive loss during pre-training or using randomly initialized hidden states, due to prohibitive computational cost of retraining.

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on observation rather than self-referential derivation

full rationale

The paper's central assertion—that standard contrastive training on pooled embeddings implicitly shapes preceding hidden-state geometry via gradient flow—is presented as an empirical demonstration rather than a mathematical derivation. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The SMART method is framed as a plug-and-play inference technique whose gains are measured externally on MMEB-V2 and other benchmarks, without reducing any result to its own inputs by construction. This is the common case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that gradient flow during pooled contrastive training shapes hidden-state geometry sufficiently for effective late interaction; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow.
This premise is stated directly in the abstract as the foundation for applying late interaction to frozen states.

pith-pipeline@v0.9.1-grok · 5762 in / 1207 out tokens · 30996 ms · 2026-06-29T23:54:04.746307+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 21 canonical work pages · 14 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Colpali: Efficient document retrieval with vision language models

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, CELINE HUDELOT, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. InInternational Conference on Learning Representations, pages 61424–61449, 2025

2025
[4]

jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval.https://arxiv.org/abs/2506.18902, 2025

Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, and Han Xiao. jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval.https://arxiv.org/abs/2506.18902, 2025

work page arXiv 2025
[5]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.https://arxiv.org/abs/2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

E5-V: Universal Embeddings with Multimodal Large Language Models

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, De- qing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models. https://arxiv.org/abs/2407.12580, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.https://arxiv.org/abs/2410.05160, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Colbert: Efficient and effective passage search via contextualized late interaction over bert.https://arxiv.org/abs/2004.12832, 2020

Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert.https://arxiv.org/abs/2004.12832, 2020

work page arXiv 2004
[9]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.https://arxiv.org/abs/2201.12086, 2022

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.https://arxiv.org/abs/2201.12086, 2022

work page arXiv 2022
[10]

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.https://arxiv.org/abs/2601.04720, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InarXiv. arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

2023
[13]

Lamra: Large multimodal model as your advanced retrieval assistant.https://arxiv.org/abs/2412.01720, 2024

Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant.https://arxiv.org/abs/2412.01720, 2024

work page arXiv 2024
[14]

Sparse, dense, and attentional representations for text retrieval.Transactions of the Association for Computational Linguistics, 9: 329–345, 2021

Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. Sparse, dense, and attentional representations for text retrieval.Transactions of the Association for Computational Linguistics, 9: 329–345, 2021

2021
[15]

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, and Semih Yavuz. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.https://arxiv.org/abs/2507.04590, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.https://arxiv.org/abs/2103.00020, 2021. 11

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

The curse of dense low-dimensional information retrieval for large index sizes

Nils Reimers and Iryna Gurevych. The curse of dense low-dimensional information retrieval for large index sizes. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 605–611, 2021

2021
[19]

Colbertv2: Effective and efficient retrieval via lightweight late interaction

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. Colbertv2: Effective and efficient retrieval via lightweight late interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734, 2022

2022
[20]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[21]

Uniir: Training and benchmarking universal multimodal information retrievers.https://arxiv.org/abs/2311.17136, 2023

Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retrievers.https://arxiv.org/abs/2311.17136, 2023

work page arXiv 2023
[22]

On the theoretical limitations of embedding-based retrieval.https://arxiv.org/abs/2508.21038, 2026

Orion Weller, Michael Boratko, Iftekhar Naim, and Jinhyuk Lee. On the theoretical limitations of embedding-based retrieval.https://arxiv.org/abs/2508.21038, 2026

work page arXiv 2026
[23]

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

Zilin Xiao, Qi Ma, Mengting Gu, Chun cheng Jason Chen, Xintao Chen, Vicente Ordonez, and Vijai Mohan. Metaembed: Scaling multimodal retrieval at test-time with flexible late interaction. https://arxiv.org/abs/2509.18095, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al. Visrag: Vision-based retrieval-augmented generation on multi-modality documents. arXiv preprint arXiv:2410.10594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training.https://arxiv.org/abs/2303.15343, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Reasoning-augmented representations for multimodal retrieval.https://arxiv.org/abs/2602.07125, 2026

Jianrui Zhang, Anirudh Sundara Rajan, Brandon Han, Soochahn Lee, Sukanta Ganguly, and Yong Jae Lee. Reasoning-augmented representations for multimodal retrieval.https://arxiv.org/abs/2602.07125, 2026

work page arXiv 2026
[27]

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms. https://arxiv.org/abs/2412.16855, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

find the report where code x labels the red star marker

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. InarXiv, 2023. A Toy Dataset The pooling bottleneck is difficult to isolate in natural retrieval benchmarks, where global semantics, local evidence, and dataset biases are often entangled. We therefor...

2023

[1] [1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Colpali: Efficient document retrieval with vision language models

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, CELINE HUDELOT, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. InInternational Conference on Learning Representations, pages 61424–61449, 2025

2025

[4] [4]

jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval.https://arxiv.org/abs/2506.18902, 2025

Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, and Han Xiao. jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval.https://arxiv.org/abs/2506.18902, 2025

work page arXiv 2025

[5] [5]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.https://arxiv.org/abs/2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

E5-V: Universal Embeddings with Multimodal Large Language Models

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, De- qing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models. https://arxiv.org/abs/2407.12580, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.https://arxiv.org/abs/2410.05160, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Colbert: Efficient and effective passage search via contextualized late interaction over bert.https://arxiv.org/abs/2004.12832, 2020

Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert.https://arxiv.org/abs/2004.12832, 2020

work page arXiv 2004

[9] [9]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.https://arxiv.org/abs/2201.12086, 2022

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.https://arxiv.org/abs/2201.12086, 2022

work page arXiv 2022

[10] [10]

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.https://arxiv.org/abs/2601.04720, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InarXiv. arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

2023

[13] [13]

Lamra: Large multimodal model as your advanced retrieval assistant.https://arxiv.org/abs/2412.01720, 2024

Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant.https://arxiv.org/abs/2412.01720, 2024

work page arXiv 2024

[14] [14]

Sparse, dense, and attentional representations for text retrieval.Transactions of the Association for Computational Linguistics, 9: 329–345, 2021

Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. Sparse, dense, and attentional representations for text retrieval.Transactions of the Association for Computational Linguistics, 9: 329–345, 2021

2021

[15] [15]

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, and Semih Yavuz. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.https://arxiv.org/abs/2507.04590, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [17]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.https://arxiv.org/abs/2103.00020, 2021. 11

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

The curse of dense low-dimensional information retrieval for large index sizes

Nils Reimers and Iryna Gurevych. The curse of dense low-dimensional information retrieval for large index sizes. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 605–611, 2021

2021

[19] [19]

Colbertv2: Effective and efficient retrieval via lightweight late interaction

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. Colbertv2: Effective and efficient retrieval via lightweight late interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734, 2022

2022

[20] [20]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[21] [21]

Uniir: Training and benchmarking universal multimodal information retrievers.https://arxiv.org/abs/2311.17136, 2023

Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retrievers.https://arxiv.org/abs/2311.17136, 2023

work page arXiv 2023

[22] [22]

On the theoretical limitations of embedding-based retrieval.https://arxiv.org/abs/2508.21038, 2026

Orion Weller, Michael Boratko, Iftekhar Naim, and Jinhyuk Lee. On the theoretical limitations of embedding-based retrieval.https://arxiv.org/abs/2508.21038, 2026

work page arXiv 2026

[23] [23]

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

Zilin Xiao, Qi Ma, Mengting Gu, Chun cheng Jason Chen, Xintao Chen, Vicente Ordonez, and Vijai Mohan. Metaembed: Scaling multimodal retrieval at test-time with flexible late interaction. https://arxiv.org/abs/2509.18095, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al. Visrag: Vision-based retrieval-augmented generation on multi-modality documents. arXiv preprint arXiv:2410.10594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training.https://arxiv.org/abs/2303.15343, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Reasoning-augmented representations for multimodal retrieval.https://arxiv.org/abs/2602.07125, 2026

Jianrui Zhang, Anirudh Sundara Rajan, Brandon Han, Soochahn Lee, Sukanta Ganguly, and Yong Jae Lee. Reasoning-augmented representations for multimodal retrieval.https://arxiv.org/abs/2602.07125, 2026

work page arXiv 2026

[27] [27]

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms. https://arxiv.org/abs/2412.16855, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

find the report where code x labels the red star marker

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. InarXiv, 2023. A Toy Dataset The pooling bottleneck is difficult to isolate in natural retrieval benchmarks, where global semantics, local evidence, and dataset biases are often entangled. We therefor...

2023