Your Embedding Model is SMARTer Than You Think
Pith reviewed 2026-06-29 23:54 UTC · model grok-4.3
The pith
Single-vector embedding models already encode effective multi-vector retrieval in their frozen hidden states.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. Simple lightweight post-training further improves results on visual document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts.
What carries the argument
Direct late-interaction over frozen hidden states whose geometry was shaped by gradient flow from pooled-embedding contrastive training.
Load-bearing premise
Gradient flow from training the final pooled embedding already arranges the earlier hidden states into a geometry that supports effective late interaction without further adaptation.
What would settle it
Applying late interaction to the hidden states of a trained single-vector model and measuring no gain or a performance drop on MMEB-V2 or similar retrieval benchmarks would falsify the central claim.
Figures
read the original abstract
Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global representation. While efficient, they discard fine-grained, local evidence critical for dense retrieval tasks. Multi-vector approaches were introduced as a solution, but they strictly require training and many ignore the necessity of a globally summarizing representation. To address this, we introduce SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models. We first demonstrate that standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. We also reveal SMART's superior performance, as simple lightweight post-training not only saves time and compute, but also brings forth further improvement on Visual Document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts. Ultimately, SMART offers both a highly efficient inference enhancement and a powerful finetuning technique for multimodal retrieval. We open source our code and weights at https://github.com/HanSolo9682/SMART.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SMART, a framework claiming that standard contrastive training on the pooled embedding of single-vector multimodal models implicitly shapes the retrieval geometry of preceding hidden states via gradient flow; applying direct late-interaction over these frozen states at inference yields consistent performance gains across modalities (including on SOTA models on MMEB-V2) as a plug-and-play upgrade, with additional gains from lightweight post-training that can outperform multi-vector SOTA on visual document retrieval.
Significance. If the empirical results and the implicit-geometry premise hold under rigorous controls, the work would provide a practical, low-overhead bridge between single-vector efficiency and multi-vector expressivity in multimodal IR, reducing the need for specialized multi-vector training while enabling both inference-time enhancement and efficient finetuning.
major comments (2)
- [Abstract] Abstract and the demonstration section: the central premise that 'standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow' is load-bearing for the plug-and-play claim, yet the manuscript supplies no gradient analysis, token-level similarity ablations, or controls that isolate the pooled contrastive objective's effect on individual hidden states versus the pooled vector.
- [Experimental results] Experimental results on MMEB-V2 and visual document retrieval: without reported ablations that disable the contrastive loss during pre-training (or compare against randomly initialized hidden states) while measuring late-interaction gains, it remains unclear whether the observed improvements are a general consequence of contrastive training or an artifact of the specific models and datasets tested.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from explicit definitions of the late-interaction operator (e.g., max-similarity over token pairs) and the precise pooling function used during training.
- Figure captions and tables should include error bars or statistical significance tests for the reported improvements over baselines.
Simulated Author's Rebuttal
We are grateful to the referee for the constructive feedback on strengthening the evidence for our central premise. We address each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract and the demonstration section: the central premise that 'standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow' is load-bearing for the plug-and-play claim, yet the manuscript supplies no gradient analysis, token-level similarity ablations, or controls that isolate the pooled contrastive objective's effect on individual hidden states versus the pooled vector.
Authors: We acknowledge that the manuscript relies on empirical demonstration rather than explicit gradient analysis or token-level ablations. The consistent gains from applying late interaction to frozen hidden states of contrastively trained models (including SOTA models on MMEB-V2) serve as evidence that the pooled objective has shaped preceding states via gradient flow; such gains would be unlikely otherwise. We will add token-level similarity ablations comparing hidden-state geometry in the revised manuscript. revision: yes
-
Referee: [Experimental results] Experimental results on MMEB-V2 and visual document retrieval: without reported ablations that disable the contrastive loss during pre-training (or compare against randomly initialized hidden states) while measuring late-interaction gains, it remains unclear whether the observed improvements are a general consequence of contrastive training or an artifact of the specific models and datasets tested.
Authors: We agree such controls would be ideal. However, they require retraining large multimodal models without contrastive loss, which is computationally prohibitive. Our results instead demonstrate gains across multiple independently trained contrastive models and tasks. We will add a limitations discussion clarifying the scope of the empirical evidence. revision: partial
- Ablations disabling contrastive loss during pre-training or using randomly initialized hidden states, due to prohibitive computational cost of retraining.
Circularity Check
No significant circularity; empirical claims rest on observation rather than self-referential derivation
full rationale
The paper's central assertion—that standard contrastive training on pooled embeddings implicitly shapes preceding hidden-state geometry via gradient flow—is presented as an empirical demonstration rather than a mathematical derivation. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The SMART method is framed as a plug-and-play inference technique whose gains are measured externally on MMEB-V2 and other benchmarks, without reducing any result to its own inputs by construction. This is the common case of a self-contained empirical contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow.
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Colpali: Efficient document retrieval with vision language models
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, CELINE HUDELOT, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. InInternational Conference on Learning Representations, pages 61424–61449, 2025
2025
-
[4]
Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, and Han Xiao. jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval.https://arxiv.org/abs/2506.18902, 2025
-
[5]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.https://arxiv.org/abs/2106.09685, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
E5-V: Universal Embeddings with Multimodal Large Language Models
Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, De- qing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models. https://arxiv.org/abs/2407.12580, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.https://arxiv.org/abs/2410.05160, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert.https://arxiv.org/abs/2004.12832, 2020
-
[9]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.https://arxiv.org/abs/2201.12086, 2022
-
[10]
Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.https://arxiv.org/abs/2601.04720, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InarXiv. arXiv:2310.03744, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Visual instruction tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023
2023
-
[13]
Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant.https://arxiv.org/abs/2412.01720, 2024
-
[14]
Sparse, dense, and attentional representations for text retrieval.Transactions of the Association for Computational Linguistics, 9: 329–345, 2021
Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. Sparse, dense, and attentional representations for text retrieval.Transactions of the Association for Computational Linguistics, 9: 329–345, 2021
2021
-
[15]
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, and Semih Yavuz. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.https://arxiv.org/abs/2507.04590, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.https://arxiv.org/abs/2103.00020, 2021. 11
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
The curse of dense low-dimensional information retrieval for large index sizes
Nils Reimers and Iryna Gurevych. The curse of dense low-dimensional information retrieval for large index sizes. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 605–611, 2021
2021
-
[19]
Colbertv2: Effective and efficient retrieval via lightweight late interaction
Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. Colbertv2: Effective and efficient retrieval via lightweight late interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734, 2022
2022
-
[20]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
2017
-
[21]
Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retrievers.https://arxiv.org/abs/2311.17136, 2023
-
[22]
On the theoretical limitations of embedding-based retrieval.https://arxiv.org/abs/2508.21038, 2026
Orion Weller, Michael Boratko, Iftekhar Naim, and Jinhyuk Lee. On the theoretical limitations of embedding-based retrieval.https://arxiv.org/abs/2508.21038, 2026
-
[23]
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction
Zilin Xiao, Qi Ma, Mengting Gu, Chun cheng Jason Chen, Xintao Chen, Vicente Ordonez, and Vijai Mohan. Metaembed: Scaling multimodal retrieval at test-time with flexible late interaction. https://arxiv.org/abs/2509.18095, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al. Visrag: Vision-based retrieval-augmented generation on multi-modality documents. arXiv preprint arXiv:2410.10594, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Sigmoid Loss for Language Image Pre-Training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training.https://arxiv.org/abs/2303.15343, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Reasoning-augmented representations for multimodal retrieval.https://arxiv.org/abs/2602.07125, 2026
Jianrui Zhang, Anirudh Sundara Rajan, Brandon Han, Soochahn Lee, Sukanta Ganguly, and Yong Jae Lee. Reasoning-augmented representations for multimodal retrieval.https://arxiv.org/abs/2602.07125, 2026
-
[27]
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms. https://arxiv.org/abs/2412.16855, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
find the report where code x labels the red star marker
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. InarXiv, 2023. A Toy Dataset The pooling bottleneck is difficult to isolate in natural retrieval benchmarks, where global semantics, local evidence, and dataset biases are often entangled. We therefor...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.