pith. machine review for the scientific record. sign in

arxiv: 2410.05160 · v3 · pith:K7HXT4ZNnew · submitted 2024-10-07 · 💻 cs.CV · cs.AI· cs.CL

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Pith reviewed 2026-05-17 21:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords multimodal embeddingsvision-language modelscontrastive learningMMEB benchmarkuniversal embeddingsimage-text retrievalvisual grounding
0
0 comments X

The pith

A contrastive training method turns vision-language models into versatile multimodal embedding models that improve 10 to 20 percent on a new benchmark of 36 tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that contrastive training on a large collection of multimodal datasets can convert existing vision-language models into universal embedding generators. These models accept arbitrary combinations of images and text along with task instructions and output fixed-length vectors suitable for classification, retrieval, visual question answering, and grounding. A sympathetic reader would care because multimodal embedding progress has lagged behind text-only models, and a single training recipe applied to strong VLMs like LLaVA and Phi-3.5-V yields consistent gains on both seen and unseen tasks. The result suggests that the heavy lifting of building general-purpose multimodal embedders can be offloaded to already-trained VLMs rather than designing new architectures from scratch.

Core claim

VLM2Vec is a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model. Unlike CLIP or BLIP, which encode text or images independently without task instructions, VLM2Vec processes any image-text combination guided by instructions to produce a fixed-dimensional vector. When models built on Phi-3.5-V and LLaVA-1.6 are trained on the 20 training datasets of MMEB, they deliver an absolute average improvement of 10 to 20 percent over prior multimodal embedding models on the 16 held-out evaluation datasets, both in-distribution and out-of-distribution.

What carries the argument

VLM2Vec, the contrastive training procedure that adapts a vision-language model to output task-instructed embeddings from mixed image and text inputs.

If this is right

  • Existing vision-language models can be repurposed into strong embedding models without new architecture design.
  • A single training run on the MMEB training split yields gains across classification, retrieval, visual question answering, and grounding.
  • Multimodal embedding evaluation can now use a standardized benchmark that mixes in-distribution and out-of-distribution tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same recipe could be applied to even larger VLMs to test whether scaling laws observed in language models extend to multimodal embeddings.
  • Task instructions might allow a single model to switch between embedding objectives at inference time without retraining.

Load-bearing premise

That contrastive training on the 20 MMEB training datasets produces embeddings that generalize to the 16 evaluation datasets without substantial overfitting or data leakage between splits.

What would settle it

Training VLM2Vec on the 20 datasets and then measuring zero or negative improvement on a fresh multimodal task never seen in MMEB would falsify the claim of broad generalization.

read the original abstract

Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can generalize across tasks (e.g., MTEB). However, progress in learning universal multimodal embedding models has been relatively slow despite its importance and practicality. In this work, we aim to explore the potential for building universal embeddings capable of handling a wide range of downstream tasks. Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets covering both in-distribution and out-of-distribution tasks, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB. Unlike previous models such as CLIP and BLIP, which encodes text or images independently without any task instruction, VLM2Vec can process any combination of images and text to generate a fixed-dimensional vector based on task instructions. We build a series of VLM2Vec models on SoTA VLMs like Phi-3.5-V, LLaVA-1.6 and evaluate them on MMEB's evaluation split. Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in MMEB. We show that VLMs are secretly strong embedding models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MMEB, a benchmark with 36 multimodal datasets (20 for training, 16 for evaluation) spanning classification, visual question answering, multimodal retrieval, and visual grounding, including both in-distribution and out-of-distribution tasks. It proposes VLM2Vec, a contrastive training method to convert vision-language models into embedding models that incorporate task instructions to generate embeddings from mixed image-text inputs. The key finding is that VLM2Vec achieves 10% to 20% absolute improvements over prior multimodal embedding models on the MMEB evaluation split.

Significance. If the generalization results hold, this work is significant for showing that state-of-the-art VLMs can be adapted via contrastive training into strong universal multimodal embedders that handle task instructions, going beyond independent encoding in models like CLIP. The large-scale MMEB benchmark itself is a valuable resource that could standardize evaluation in the field, analogous to MTEB for text embeddings.

major comments (2)
  1. [MMEB Benchmark Construction] MMEB construction and split description: No quantitative checks (image hashing, caption similarity, or source provenance analysis) are reported to rule out sample overlap or near-duplicates between the 20 training datasets and 16 evaluation datasets. This directly affects the load-bearing claim of generalization to out-of-distribution tasks and the interpretation of the 10-20% gains as arising from the VLM2Vec objective rather than leakage.
  2. [Experiments and Results] Experimental protocol and baselines: Insufficient detail is given on exact baseline re-implementations (e.g., whether CLIP/BLIP variants were re-trained on the same MMEB training split with identical prompts or used off-the-shelf), evaluation protocols, and contamination controls. This weakens the quantitative support for the central performance claims.
minor comments (2)
  1. [Abstract] The phrase 'VLMs are secretly strong embedding models' in the abstract is informal; a more precise statement such as 'VLMs can be effectively adapted as task-aware embedding models' would improve formality.
  2. [Results] Tables reporting average improvements should explicitly separate in-distribution and out-of-distribution results and include standard deviations or statistical tests to support the 'absolute average improvement of 10% to 20%' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments on benchmark validation and experimental transparency are well-taken and will improve the manuscript. We address each major comment below and commit to revisions that strengthen the presentation of our results without altering the core claims.

read point-by-point responses
  1. Referee: [MMEB Benchmark Construction] MMEB construction and split description: No quantitative checks (image hashing, caption similarity, or source provenance analysis) are reported to rule out sample overlap or near-duplicates between the 20 training datasets and 16 evaluation datasets. This directly affects the load-bearing claim of generalization to out-of-distribution tasks and the interpretation of the 10-20% gains as arising from the VLM2Vec objective rather than leakage.

    Authors: We acknowledge that the original manuscript did not report explicit quantitative overlap analyses. The 36 datasets were drawn from established public benchmarks and retained their original train/evaluation splits to preserve task diversity and out-of-distribution coverage. In the revised version we will add a dedicated appendix section that quantifies potential overlaps using perceptual image hashing and sentence-embedding cosine similarity between the training and evaluation partitions. Preliminary internal checks show overlap rates below 1 percent; these results will be reported to support the interpretation that the observed gains stem from the contrastive training objective rather than data leakage. revision: yes

  2. Referee: [Experiments and Results] Experimental protocol and baselines: Insufficient detail is given on exact baseline re-implementations (e.g., whether CLIP/BLIP variants were re-trained on the same MMEB training split with identical prompts or used off-the-shelf), evaluation protocols, and contamination controls. This weakens the quantitative support for the central performance claims.

    Authors: We agree that additional protocol details are required for reproducibility. All reported baselines (CLIP, BLIP, and related models) were evaluated using their publicly released checkpoints without any fine-tuning on the MMEB training split, preserving a fair comparison to prior work that does not incorporate task instructions. In the revision we will expand the experimental section and add an appendix that specifies exact prompt templates, similarity computation, batch sizes, and hardware settings. We will also include an explicit discussion of contamination controls, confirming that evaluation tasks were chosen to avoid source overlap with training data and describing the steps taken to mitigate leakage risks. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical results on held-out MMEB evaluation splits

full rationale

The paper introduces the MMEB benchmark with an explicit partition into 20 training datasets and 16 distinct evaluation datasets (covering in-distribution and out-of-distribution tasks), trains VLM2Vec via contrastive learning on the training split, and reports performance metrics on the held-out evaluation split. This constitutes an independent empirical test rather than any derivation that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the core claims; the 10-20% gains are measured against external held-out data and therefore remain falsifiable outside the training procedure.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard contrastive learning assumptions and the construction of a new benchmark; no new physical entities or ad-hoc constants are introduced.

free parameters (1)
  • contrastive temperature
    Standard hyperparameter in contrastive objectives that is typically tuned on validation data.
axioms (1)
  • domain assumption Contrastive loss on task-instructed multimodal inputs produces useful fixed-dimensional embeddings
    Invoked when the authors convert VLMs into embedders via contrastive training on MMEB.

pith-pipeline@v0.9.0 · 5627 in / 1308 out tokens · 48224 ms · 2026-05-17T21:14:56.020911+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.

  2. Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

    cs.CL 2026-05 unverdicted novelty 7.0

    Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...

  3. jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

    cs.CL 2026-05 unverdicted novelty 7.0

    Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.

  4. MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

    cs.IR 2026-04 unverdicted novelty 7.0

    MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

  5. mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.

  6. Bottleneck Tokens for Unified Multimodal Retrieval

    cs.LG 2026-04 unverdicted novelty 7.0

    Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.

  7. Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.

  8. MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL

    cs.IR 2026-04 unverdicted novelty 7.0

    MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.

  9. PLUME: Latent Reasoning Based Universal Multimodal Embedding

    cs.CV 2026-04 unverdicted novelty 7.0

    PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.

  10. Adapting MLLMs for Nuanced Video Retrieval

    cs.CV 2025-12 unverdicted novelty 7.0

    Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.

  11. jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

    cs.CL 2026-05 unverdicted novelty 6.0

    GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...

  12. Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

    cs.CV 2026-04 unverdicted novelty 6.0

    Rewrite-driven generation with alignment and RL produces shorter, more effective generative multimodal embeddings than CoT methods on retrieval benchmarks.

  13. HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval

    cs.IR 2026-04 unverdicted novelty 6.0

    HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...

  14. CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding

    cs.CL 2026-01 unverdicted novelty 6.0

    CausalEmbed uses auto-regressive generation with iterative margin loss to produce multi-vector embeddings that reduce visual token counts 30-155x while retaining competitive performance on VDR benchmarks.

  15. EmbeddingGemma: Powerful and Lightweight Text Representations

    cs.CL 2025-09 unverdicted novelty 6.0

    A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.

  16. MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

    cs.IR 2025-09 unverdicted novelty 6.0

    MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.

  17. Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval

    cs.CV 2026-04 unverdicted novelty 5.0

    SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.

  18. BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

    cs.IR 2026-04 unverdicted novelty 5.0

    BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retrie...

  19. Attention Grounded Enhancement for Visual Document Retrieval

    cs.IR 2025-11 unverdicted novelty 5.0

    AGREE boosts visual document retrieval by adding local relevance signals from MLLM attention maps to global document labels during retriever training.

  20. VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

    cs.CV 2025-07 unverdicted novelty 5.0

    VLM2Vec-V2 is a multimodal embedding model trained on an extended MMEB-V2 benchmark that adds video and visual document tasks and reports gains on both new and prior image benchmarks.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 19 Pith papers · 9 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical re- port: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219,

  2. [2]

    SemEval-2012 task 6: A pilot on semantic textual similarity

    Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. SemEval-2012 task 6: A pilot on semantic textual similarity. In Eneko Agirre, Johan Bos, Mona Diab, Suresh Manandhar, Yuval Marton, and Deniz Yuret (eds.), *SEM 2012: The First Joint Conference on Lexical and Com- putational Semantics – Volume 1: Proceedings of the main conference and the sha...

  3. [3]

    URL https://aclanthology.org/S12-1051

    Association for Computational Linguis- tics. URL https://aclanthology.org/S12-1051. Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Han- naneh Hajishirzi, and Wen-tau Yih. Task-aware retrieval with instructions. arXiv preprint arXiv:2211.09260,

  4. [4]

    Llm2vec: Large language models are secretly powerful text encoders

    Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapa- dos, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961,

  5. [5]

    SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

    Daniel Cer, Mona Diab, Eneko Agirre, I ˜nigo Lopez-Gazpio, and Lucia Specia. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Steven Bethard, Marine Carpuat, Marianna Apidianaki, Saif M. Mohammad, Daniel Cer, and David Ju- rgens (eds.), Proceedings of the 11th International Workshop on Semantic Evaluati...

  6. [6]

    doi: 10.18653/v1/S17-2001

    Association for Computational Linguistics. doi: 10.18653/v1/S17-2001. URL https://aclanthology.org/S17-2001. Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. Webqa: Multihop and multimodal qa. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16495–16504,

  7. [7]

    Supervised learning of universal sentence representations from natural language inference data

    Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo ¨ıc Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In Proceed- ings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680,

  8. [8]

    Imagenet: A large-scale hi- erarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi- erarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,

  9. [9]

    org/CorpusID:207252270

    URL https://api.semanticscholar. org/CorpusID:207252270. 12 Manuscript Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv preprint arXiv:2306.09344,

  10. [10]

    Scaling deep contrastive learning batch size under memory limited setup

    Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983, 2021a. Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processin...

  11. [11]

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

    URL https://arxiv.org/abs/2007.0128. Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning , pp. 4904–4916. PMLR,

  12. [12]

    E5-V: Universal Embeddings with Multimodal Large Language Models

    Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models. arXiv preprint arXiv:2407.12580,

  13. [13]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781,

  14. [14]

    Referitgame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798,

  15. [15]

    NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019a. Tom Kwiatkowski, Jennimaria Palomaki, Oliv...

  16. [16]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895,

  17. [17]

    Towards General Text Embeddings with Multi-stage Contrastive Learning

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pp. 19730–19742. PMLR, 2023a. Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with ...

  18. [18]

    Visual news: Benchmark and challenges in news image captioning

    Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Ordonez. Visual news: Benchmark and challenges in news image captioning. arXiv preprint arXiv:2010.03743,

  19. [19]

    What makes good in-context examples for gpt-3? DeeLIO 2022, pp

    Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? DeeLIO 2022, pp. 100,

  20. [20]

    Edis: Entity-driven image search over multimodal web content

    Siqi Liu, Weixi Feng, Tsu-jui Fu, Wenhu Chen, and William Yang Wang. Edis: Entity-driven image search over multimodal web content. arXiv preprint arXiv:2305.13631,

  21. [21]

    Unifying multimodal retrieval via document screenshot embedding

    Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin. Unifying multimodal retrieval via document screenshot embedding. arXiv preprint arXiv:2406.11251, 2024a. Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. Fine-tuning llama for multi-stage text retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research...

  22. [22]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A bench- mark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244,

  23. [23]

    Efficient Estimation of Word Representations in Vector Space

    URL https://huggingface. co/Salesforce/SFR-Embedding-2_R. Tomas Mikolov. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781,

  24. [24]

    Mteb: Massive text em- bedding benchmark

    Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. Mteb: Massive text em- bedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 2014–2037,

  25. [25]

    Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, et al

    URL https://www.microsoft.com/en-us/research/publication/ ms-marco-human-generated-machine-reading-comprehension-dataset/ . Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, et al. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical ...

  26. [26]

    Glove: Global vectors for word representation

    15 Manuscript Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543,

  27. [27]

    Sentence-BERT: Sentence embeddings using Siamese BERT- networks

    Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT- networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992, Hong Kong, China, November

  28. [28]

    doi: 10.18653/v1/D19-1410

    Association for Com- putational Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/ D19-1410. Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Techno...

  29. [29]

    Rep- etition improves language model embeddings.arXiv preprint arXiv:2402.15449,

    Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, and Aditi Raghunathan. Rep- etition improves language model embeddings. arXiv preprint arXiv:2402.15449,

  30. [30]

    One embedder, any task: Instruction-finetuned text embeddings

    Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings. In Findings of the Association for Computational Linguistics: ACL 2023 , pp. 1102–1121,

  31. [31]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389,

  32. [32]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Nandan Thakur, Nils Reimers, Andreas R¨uckl´e, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin ...

  33. [33]

    N24news: A new dataset for multimodal news classification

    Zhen Wang, Xu Shan, Xiangxie Zhang, and Jie Yang. N24news: A new dataset for multimodal news classification. arXiv preprint arXiv:2108.13327,

  34. [34]

    Simvlm: Sim- ple visual language model pretraining with weak supervision

    16 Manuscript Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Sim- ple visual language model pretraining with weak supervision. In International Conference on Learning Representations, 2022b. Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking ...

  35. [35]

    Sun database: Large-scale scene recognition from abbey to zoo

    Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 3485–3492. IEEE,

  36. [36]

    Approximate nearest neighbor negative contrastive learning for dense text retrieval.arXiv preprint arXiv:2007.00808,

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808,

  37. [37]

    Magiclens: Self-supervised image retrieval with open-ended instructions

    Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magiclens: Self-supervised image retrieval with open-ended instructions. arXiv preprint arXiv:2403.19651,

  38. [38]

    The original dataset consists of triplets: a reference image and two perturbed versions, along with human judgments indicating which version is most similar to the reference

    The dataset contains human similarity judgments on image pairs that are alike in various ways. The original dataset consists of triplets: a reference image and two perturbed versions, along with human judgments indicating which version is most similar to the reference. Following M-BEIR (Wei et al., 2023), we refactor this dataset into a retrieval task to ...

  39. [39]

    This dataset contains entity-rich queries, requiring the model to understand both entities and events from the text queries

    The dataset is a cross-modal image search in the news domain. This dataset contains entity-rich queries, requiring the model to understand both entities and events from the text queries. The candidate consists of the news image and its accompanying headline. Wiki-SS-NQ (Ma et al., 2024a) The dataset is another retrieval-based VQA dataset. Unlike the origi...

  40. [40]

    telling” and “pointing

    The dataset establishes a semantic link between textual de- scriptions and image regions through object-level grounding. It has two types of questions: “telling” and “pointing”. It leverages the six W questions (what, where, when, who, why, and how) to sys- tematically examine a model’s capability for visual understanding through telling questions. Addi- ...

  41. [41]

    Represent the given news image with the following caption for domain classifi- cation. Ms. Goodman styled Am- ber Valletta with wings for a 1993 shoot by Peter Lind- bergh for Harper’s Bazaar. Style - VOC2007 (Everingham et al.,

  42. [42]

    bus - SUN397 (Xiao et al., 2010)Identify the scene shown in the image

    Identify the object shown in the image. bus - SUN397 (Xiao et al., 2010)Identify the scene shown in the image. firing range indoor - ObjectNet (Barbu et al.,

  43. [43]

    Find a Wikipedia image-passage pair that answers this question. Do both the Hays County Court- house in San Marcos, Texas and the Ike Wood House at 227 Mitchell Street in San Marcos, Texas have six columns on their front entrance? - Represent the given Wikipedia im- age with related text information. Hays County Courthouse (2018), San Marcos, TX The Hays ...

  44. [44]

    Tom Holland makes his debut in the Spidey suit in Captain America Civil War

    Find a news image that matches the provided caption. Tom Holland makes his debut in the Spidey suit in Captain America Civil War. - Represent the given image with re- lated text information. Comic RiffsJon Favreau is set to reprise his Iron Man role for Spider Man: Homecoming. Wiki-SS-NQ (Ma et al., 2024a)Find the document screenshot that can answer the g...

  45. [45]

    kid on right in back, blondish hair Select the portion of the image that follows the language expressions

    Select the portion of the image that follows the language expressions. kid on right in back, blondish hair Select the portion of the image that follows the language expressions. top right kid Table 11: Zero-shot text-image retrieval performance on Flickr30K. As a general multimodal rep- resentation model, VL M2VE C can still achieve competitive T2I (Text-...