pith. machine review for the scientific record. sign in

arxiv: 2412.16855 · v2 · submitted 2024-12-22 · 💻 cs.CL · cs.IR

Recognition: 2 theorem links

· Lean Theorem

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:31 UTC · model grok-4.3

classification 💻 cs.CL cs.IR
keywords universal multimodal retrievalmultimodal large language modelsfused-modal training datadense retrieverdata synthesis pipelineUMR benchmarkcross-modal searchsynthetic dataset
0
0 comments X

The pith

Training an MLLM on synthetically balanced fused text-image data produces a single dense retriever that leads on universal multimodal search tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that multimodal large language models can serve as universal retrievers when given diverse training examples that combine text and images, rather than text alone. Earlier attempts stayed limited because available multimodal data was heavily skewed toward one modality, so the authors built a synthesis pipeline to create a large, balanced fused-modal dataset. They then train the General Multimodal Embedder on this data and release a new benchmark covering pure text, pure image, and mixed queries. A sympathetic reader would care because a single model could replace separate text and image search systems and handle queries that mix modalities without retraining or reindexing.

Core claim

The General Multimodal Embedder is an MLLM turned into a dense retriever by training it on a large-scale synthetic fused-modal dataset constructed through a dedicated synthesis pipeline; this training regime lifts performance to state-of-the-art levels on the new Universal Multimodal Retrieval Benchmark across text-only, image-only, and mixed-modality query-candidate pairs.

What carries the argument

The General Multimodal Embedder (GME), an MLLM-based dense retriever whose embeddings are learned from the authors' synthetically generated fused-modal training set to support retrieval regardless of whether queries and candidates are text, images, or combinations.

If this is right

  • A single embedding space now supports retrieval when the query is text, an image, or both, and the candidate set can be any of the same combinations.
  • Model scaling and careful choice of training strategy continue to raise accuracy on the UMR benchmark once the fused data is available.
  • Ablation results isolate the contribution of data diversity and show that removing the synthesis step drops performance back toward prior text-only baselines.
  • The new UMRB provides a standardized test bed that future universal retrievers can be measured against directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar synthesis pipelines could be applied to add video or audio modalities without requiring massive new human annotation.
  • Real-world search engines might adopt one index instead of maintaining separate text and vision indexes, lowering storage and maintenance costs.
  • The same training recipe could be tested on open-ended multimodal question answering to check whether retrieval gains translate to generation tasks.

Load-bearing premise

The synthetic fused-modal training dataset is of high quality and sufficiently diverse to unlock MLLM potential for universal retrieval without introducing biases or artifacts.

What would settle it

If an identically sized MLLM trained on an equal volume of real, balanced multimodal data instead of the synthetic set achieves equal or higher accuracy on the UMRB, the necessity of the synthesis pipeline for the claimed gains would be refuted.

read the original abstract

Universal Multimodal Retrieval (UMR) aims to enable search across various modalities using a unified model, where queries and candidates can consist of pure text, images, or a combination of both. Previous work has attempted to adopt multimodal large language models (MLLMs) to realize UMR using only text data. However, our preliminary experiments demonstrate that more diverse multimodal training data can further unlock the potential of MLLMs. Despite its effectiveness, the existing multimodal training data is highly imbalanced in terms of modality, which motivates us to develop a training data synthesis pipeline and construct a large-scale, high-quality fused-modal training dataset. Based on the synthetic training data, we develop the General Multimodal Embedder (GME), an MLLM-based dense retriever designed for UMR. Furthermore, we construct a comprehensive UMR Benchmark (UMRB) to evaluate the effectiveness of our approach. Experimental results show that our method achieves state-of-the-art performance among existing UMR methods. Last, we provide in-depth analyses of model scaling and training strategies, and perform ablation studies on both the model and synthetic data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes the General Multimodal Embedder (GME), an MLLM-based dense retriever for universal multimodal retrieval (UMR) that supports text, image, and fused-modal queries/candidates. To overcome modality imbalance in prior training data, the authors introduce a synthesis pipeline that constructs a large-scale fused-modal dataset; they also release the UMR Benchmark (UMRB) and report that GME attains state-of-the-art results on it, supported by scaling studies, training-strategy analyses, and ablations.

Significance. If the synthetic-data quality and experimental superiority hold, the work would meaningfully advance UMR by showing that carefully balanced multimodal training data can better exploit MLLM capacity for cross-modal retrieval, providing both a practical model and a new evaluation benchmark.

major comments (1)
  1. [§3 and §4] §3 (Dataset Synthesis Pipeline) and §4 (Experiments): the central SOTA claim on UMRB rests on the assumption that the synthetic fused-modal dataset is high-quality, balanced, and free of systematic artifacts or hallucinations; however, the manuscript reports no independent quantitative checks (e.g., modality-balance statistics, diversity metrics, or human validation of generated pairs) that would confirm this assumption, leaving the performance gains vulnerable to data-induced bias.
minor comments (2)
  1. [Abstract and §4] Abstract and §4: experimental details on exact baselines, evaluation metrics, statistical significance testing, and hyper-parameter settings are only sketched; these should be expanded with concrete numbers and tables for reproducibility.
  2. [§5] §5 (Ablations): the scaling and training-strategy analyses would benefit from clearer notation distinguishing the contributions of data volume versus data modality diversity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Dataset Synthesis Pipeline) and §4 (Experiments): the central SOTA claim on UMRB rests on the assumption that the synthetic fused-modal dataset is high-quality, balanced, and free of systematic artifacts or hallucinations; however, the manuscript reports no independent quantitative checks (e.g., modality-balance statistics, diversity metrics, or human validation of generated pairs) that would confirm this assumption, leaving the performance gains vulnerable to data-induced bias.

    Authors: We thank the referee for this important observation. The synthesis pipeline in §3 incorporates explicit balancing steps (equal sampling across text-only, image-only, and fused-modal pairs) and quality filters (heuristic length checks plus MLLM-based relevance scoring) to mitigate imbalance and hallucinations. However, we acknowledge that the original manuscript did not report standalone quantitative validation metrics for the final dataset. In the revised version we will add: (i) modality-balance statistics (exact counts and percentages of each modality combination), (ii) diversity metrics (average token length, unique n-gram coverage, and average pairwise embedding cosine similarity), and (iii) a human validation study on a random 500-pair subset reporting hallucination and relevance rates. These results will appear in §3 with supporting tables and examples moved to the appendix. We believe the added evidence will directly address the concern while preserving the experimental claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical training and evaluation chain is self-contained

full rationale

The paper constructs a synthetic fused-modal dataset to address modality imbalance in existing data, trains the GME MLLM-based retriever on it, builds the UMRB benchmark, and reports SOTA empirical results. No derivation step reduces by construction to its inputs, no fitted parameter is relabeled as a prediction, and no load-bearing claim rests on self-citation chains or imported uniqueness theorems. All central results derive from standard training-plus-evaluation on held-out benchmarks rather than tautological re-expression of the synthesis pipeline or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim depends on the assumption that synthetic data can effectively substitute for real data and that MLLMs are suitable for embedding-based retrieval.

axioms (1)
  • domain assumption Diverse multimodal training data improves MLLM performance on UMR tasks
    Based on preliminary experiments mentioned in the abstract.

pith-pipeline@v0.9.0 · 5516 in / 1211 out tokens · 48454 ms · 2026-05-15T06:31:17.683583+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.

  2. jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

    cs.CL 2026-05 unverdicted novelty 7.0

    Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.

  3. MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

    cs.IR 2026-04 unverdicted novelty 7.0

    MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

  4. CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval

    cs.SE 2026-04 unverdicted novelty 7.0

    CodeMMR creates a unified embedding space for text, code, and images, outperforming baselines by 10 nDCG@10 points and boosting RAG code generation quality.

  5. SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.

  6. Bottleneck Tokens for Unified Multimodal Retrieval

    cs.LG 2026-04 unverdicted novelty 7.0

    Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.

  7. Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.

  8. MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL

    cs.IR 2026-04 unverdicted novelty 7.0

    MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.

  9. Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    Introduces the OACIR task requiring instance-level consistency via bounding-box anchors, a 160K real-world benchmark OACIRR, and the AdaFocal framework that adaptively focuses attention on the anchored region.

  10. PLUME: Latent Reasoning Based Universal Multimodal Embedding

    cs.CV 2026-04 unverdicted novelty 7.0

    PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.

  11. jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

    cs.CL 2026-05 unverdicted novelty 6.0

    GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...

  12. DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

    cs.CV 2026-04 unverdicted novelty 6.0

    A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...

  13. Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

    cs.CV 2026-04 unverdicted novelty 6.0

    Rewrite-driven generation with alignment and RL produces shorter, more effective generative multimodal embeddings than CoT methods on retrieval benchmarks.

  14. MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment

    cs.CV 2026-04 unverdicted novelty 6.0

    MiMIC mitigates visual modality collapse and semantic misalignment in universal multimodal retrieval via fusion-in-decoder architecture and robust single-modality training.

  15. SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

    cs.CV 2026-04 conditional novelty 6.0

    SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.

  16. ViLL-E: Video LLM Embeddings for Retrieval

    cs.CV 2026-04 unverdicted novelty 6.0

    ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.

  17. HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval

    cs.IR 2026-04 unverdicted novelty 6.0

    HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...

  18. A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval

    cs.CV 2026-05 conditional novelty 5.0

    Single-vector aggregation in visual financial document retrieval collapses semantically distinct documents due to global texture dominance, as demonstrated by a new diagnostic benchmark where patch-level signals detec...

  19. TriAlignGR: Triangular Multitask Alignment with Multimodal Deep Interest Mining for Generative Recommendation

    cs.IR 2026-05 unverdicted novelty 5.0

    TriAlignGR integrates visual content and latent user interests into Semantic IDs via cross-modal alignment, CoT-based interest mining, and triangular multitask training to address content degradation and semantic opac...

  20. Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval

    cs.CV 2026-04 unverdicted novelty 5.0

    SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.

  21. BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

    cs.IR 2026-04 unverdicted novelty 5.0

    BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retrie...

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 19 Pith papers · 8 internal anchors

  1. [1]

    Overview of touch ´e 2020: Argument retrieval - extended abstract

    Alexander Bondarenko, Maik Fr ¨obe, Meriem Beloucif, Lukas Gienapp, Yamen Ajjour, Alexander Panchenko, Chris Biemann, Benno Stein, Henning Wachsmuth, Martin Pot- thast, and Matthias Hagen. Overview of touch ´e 2020: Argument retrieval - extended abstract. In Experimental IR Meets Multilinguality, Multimodality, and Interaction - CLEF 2020, pages 384–395. ...

  2. [2]

    A full-text learning to rank dataset for medical information retrieval

    Vera Boteva, Demian Gholipour Ghalandari, Artem Sokolov, and Stefan Riezler. A full-text learning to rank dataset for medical information retrieval. In Advances in Information Retrieval - 38th European Conference on IR Research, ECIR 2016, Padua, Italy, pages 716–722. Springer, 2016. 3

  3. [3]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz...

  4. [4]

    Webqa: Multihop and multimodal QA

    Yingshan Chang, Guihong Cao, Mridu Narang, Jianfeng Gao, Hisami Suzuki, and Yonatan Bisk. Webqa: Multihop and multimodal QA. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16474–16483, 2022. 1, 3, 12, 14

  5. [5]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. CoRR, abs/1604.06174, 2016. 6

  6. [6]

    Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pages 14948–14968, Singapore, 2023. Association for Computational Linguistics....

  7. [7]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101,

  8. [8]

    SPECTER: Document-level representa- tion learning using citation-informed transformers

    Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. SPECTER: Document-level representa- tion learning using citation-informed transformers. In Pro- ceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online, 2020. Association for Computational Linguistics. 3

  9. [9]

    Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B

    Zhuyun Dai, Vincent Y . Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B. Hall, and Ming-Wei Chang. Promptagator: Few-shot dense retrieval from 8 examples. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 6, 18

  10. [10]

    Imagenet: A large-scale hierarchical im- age database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Miami, USA , pages 248–255. IEEE Computer Society, 2009. 4, 6

  11. [11]

    Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, and Markus Leippold

    Thomas Diggelmann, Jordan L. Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, and Markus Leippold. CLIMATE- FEVER: A dataset for verification of real-world climate claims. CoRR, abs/2012.00614, 2020. 3

  12. [12]

    Col- pali: Efficient document retrieval with vision language mod- els

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C ´eline Hudelot, and Pierre Colombo. Col- pali: Efficient document retrieval with vision language mod- els. In The Thirteenth International Conference on Learning Representations, 2025. 2, 3, 7, 14, 16

  13. [13]

    Dreamsim: Learning new dimensions of human visual simi- larity using synthetic data

    Stephanie Fu, Netanel Yakir Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual simi- larity using synthetic data. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 3, 12

  14. [14]

    SimCSE: Simple contrastive learning of sentence embeddings

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. 2, 6

  15. [15]

    Doc2query-: When less is more

    Mitko Gospodinov, Sean MacAvaney, and Craig Macdonald. Doc2query-: When less is more. In Advances in Information Retrieval - 45th European Conference on Information Re- trieval, ECIR 2023 , pages 414–422, Dublin, Ireland, 2023. Springer. 5

  16. [16]

    Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S

    Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. Au- tomatic spatially-aware fashion concept discovery. In IEEE International Conference on Computer Vision, ICCV 2017 , pages 1472–1480, Venice, Italy, 2017. IEEE Computer Soci- ety. 3, 14

  17. [17]

    Dbpedia-entity v2: A test collection for entity search

    Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik Bratsberg, Alexander Kotov, and Jamie Callan. Dbpedia-entity v2: A test collection for entity search. In Proceedings of the 40th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval, page 1265–1268, New York, NY , USA, 2017. Association for Comp...

  18. [18]

    Verspoor, and Timothy Bald- win

    Doris Hoogeveen, Karin M. Verspoor, and Timothy Bald- win. Cqadupstack: A benchmark data set for community question-answering research. In Proceedings of the 20th Australasian Document Computing Symposium , New York, NY , USA, 2015. Association for Computing Machinery. 3

  19. [19]

    9 LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 9 LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations , 2022. 6, 15

  20. [20]

    Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities

    Hexiang Hu, Yi Luan, Yang Chen, Urvashi Khandelwal, Mandar Joshi, Kenton Lee, Kristina Toutanova, and Ming- Wei Chang. Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. In IEEE/CVF International Conference on Computer Vision, ICCV 2023 , pages 12031–12041, Paris, France, 2023. IEEE. 3, 14

  21. [21]

    Unsupervised dense information retrieval with con- trastive learning

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebas- tian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with con- trastive learning. Transactions on Machine Learning Re- search, 2022. 3

  22. [22]

    E5-V: universal embeddings with multi- modal large language models

    Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-V: universal embeddings with multi- modal large language models. CoRR, abs/2407.12580, 2024. 1, 2, 3, 4, 6, 7

  23. [23]

    VLM2vec: Training vision- language models for massive multimodal embedding tasks

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. VLM2vec: Training vision- language models for massive multimodal embedding tasks. In The Thirteenth International Conference on Learning Representations, 2025. 2, 3

  24. [24]

    TriviaQA: A large scale distantly supervised chal- lenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettle- moyer. TriviaQA: A large scale distantly supervised chal- lenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers) , pages 1601– 1611, Vancouver, Canada, 2017. Association for Computa- tional Linguistics. 6

  25. [25]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen- tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, 2020. Association for Computa- tional Linguistics. 1

  26. [26]

    Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Ep- stein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering re- search. Tra...

  27. [27]

    Building and better understanding vision- language models: insights and future directions

    Hugo Laurenc ¸on, Andr´es Marafioti, Victor Sanh, and Leo Tronchon. Building and better understanding vision- language models: insights and future directions. In Work- shop on Responsibly Building the Next Generation of Multi- modal Foundational Models, 2024. 5, 6

  28. [28]

    NV- embed: Improved techniques for training LLMs as generalist embedding models

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. NV- embed: Improved techniques for training LLMs as generalist embedding models. In The Thirteenth International Confer- ence on Learning Representations, 2025. 3

  29. [29]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. InIn- ternational Conference on Machine Learning, ICML 2022 , pages 12888–12900, Baltimore, Maryland, USA, 2022. PMLR. 2

  30. [30]

    Multimodal ArXiv: A dataset for improving scientific comprehension of large vision-language models

    Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal ArXiv: A dataset for improving scientific comprehension of large vision-language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 14369–14387, Bangkok, Thailand, 2024. Association ...

  31. [31]

    Towards General Text Embeddings with Multi-stage Contrastive Learning

    Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning. CoRR, abs/2308.03281, 2023. 3, 4

  32. [32]

    Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C

    Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In 13th European Conference on Computer Vision, ECCV 2014 , pages 740–755, Zurich, Switzerland, 2014. Springer. 3, 6, 14

  33. [33]

    PreFLMR: Scaling up fine-grained late-interaction multi- modal retrievers

    Weizhe Lin, Jingbiao Mei, Jinghong Chen, and Bill Byrne. PreFLMR: Scaling up fine-grained late-interaction multi- modal retrievers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5294–5316, Bangkok, Thailand, 2024. Association for Computational Linguistics. 3, 14

  34. [34]

    Visual news: Benchmark and challenges in news im- age captioning

    Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Or- donez. Visual news: Benchmark and challenges in news im- age captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 6761–6771, Online and Punta Cana, Dominican Republic,

  35. [35]

    Association for Computational Linguistics. 3, 13

  36. [36]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 1, 2

  37. [37]

    Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 2

  38. [38]

    EDIS: Entity-driven image search over multimodal web content

    Siqi Liu, Weixi Feng, Tsu-Jui Fu, Wenhu Chen, and William Wang. EDIS: Entity-driven image search over multimodal web content. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 4877–4894, Singapore, 2023. Association for Computational Linguistics. 3, 14

  39. [39]

    Image retrieval on real-life images with pre-trained vision-and-language models

    Zheyuan Liu, Cristian Rodriguez Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021 , pages 2105–2114, Montreal, Canada, 2021. IEEE. 3, 5, 14

  40. [40]

    Universal vision-language dense retrieval: Learning a unified representation space for multi-modal re- trieval

    Zhenghao Liu, Chenyan Xiong, Yuanhuiyi Lv, Zhiyuan Liu, and Ge Yu. Universal vision-language dense retrieval: Learning a unified representation space for multi-modal re- trieval. In The Eleventh International Conference on Learn- ing Representations, 2023. 1, 2

  41. [41]

    End-to-end knowledge retrieval with multi- 10 modal queries

    Man Luo, Zhiyuan Fang, Tejas Gokhale, Yezhou Yang, and Chitta Baral. End-to-end knowledge retrieval with multi- 10 modal queries. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8573–8589, Toronto, Canada, 2023. As- sociation for Computational Linguistics. 3, 14

  42. [42]

    Unifying multimodal retrieval via document screenshot embedding

    Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin. Unifying multimodal retrieval via document screenshot embedding. In Proceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Processing, pages 6492–6505, Miami, Florida, USA, 2024. Association for Computational Linguistics. 2, 6, 7

  43. [43]

    Www’18 open challenge: Financial opinion min- ing and question answering

    Macedo Maia, Siegfried Handschuh, Andr ´e Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. Www’18 open challenge: Financial opinion min- ing and question answering. In Companion of the The Web Conference 2018 on The Web Conference 2018, pages 1941– 1942, Lyon, France, 2018. ACM. 3

  44. [44]

    OK-VQA: A visual question answer- ing benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A visual question answer- ing benchmark requiring external knowledge. In IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2019, pages 3195–3204, Long Beach, CA, USA, 2019. Com- puter Vision Foundation / IEEE. 3, 14

  45. [45]

    Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawa- har. Docvqa: A dataset for VQA on document images. In IEEE Winter Conference on Applications of Computer Vi- sion, WACV 2021 , pages 2199–2208, Waikoloa, HI, USA,

  46. [46]

    Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V . Jawahar. Infograph- icvqa. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022, pages 2582–2591. IEEE, 2022. 3

  47. [47]

    Thomas Mensink, Jasper R. R. Uijlings, Llu ´ıs Castrej´on, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, Andr ´e Ara´ujo, and Vittorio Ferrari. Encyclopedic VQA: visual questions about detailed properties of fine-grained cate- gories. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, pages 3090–3101, Paris, France, 2023. IEEE. 3, 5, 14

  48. [48]

    MS MARCO: A human generated machine reading comprehen- sion dataset

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO: A human generated machine reading comprehen- sion dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016, Barcelona, Spain, 2016. CEUR-WS.org. 3, 4, 6

  49. [49]

    Gpt-4v(ision) system card, 2023

    OpenAI. Gpt-4v(ision) system card, 2023. 2

  50. [50]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. CoRR, abs/2303.08774,

  51. [51]

    Plummer, Liwei Wang, Chris M

    Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, pages 2641–2649, Santiago, Chile, 2015. IEEE Com- puter Society. 3, 14

  52. [52]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, pages 8748–8763....

  53. [53]

    SQuAD: 100,000+ questions for machine com- prehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine com- prehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages 2383–2392, Austin, Texas, 2016. Association for Computa- tional Linguistics. 6

  54. [54]

    Contrastive learning with hard negative samples

    Joshua David Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples. In International Conference on Learning Repre- sentations, 2021. 4

  55. [55]

    LAION-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text m...

  56. [56]

    BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models

    Nandan Thakur, Nils Reimers, Andreas R ¨uckl´e, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. 2, 3, 7, 13, 16, 18

  57. [57]

    FEVER: a large-scale dataset for fact extraction and VERification

    James Thorne, Andreas Vlachos, Christos Christodoulopou- los, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long Papers), pages 809–819, New Or- leans, Louis...

  58. [58]

    Representation Learning with Contrastive Predictive Coding

    A ¨aron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018. 4

  59. [59]

    Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang

    Ellen V oorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R. Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. Trec-covid: constructing a pandemic information retrieval test collection.SIGIR Forum, 54(1), 2021. 3

  60. [60]

    Retrieval of the best counterargument without prior topic knowledge

    Henning Wachsmuth, Shahbaz Syed, and Benno Stein. Retrieval of the best counterargument without prior topic knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 241–251, Melbourne, Australia, 2018. Asso- ciation for Computational Linguistics. 3

  61. [61]

    Fact or fiction: Verifying scientific claims

    David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Ha- jishirzi. Fact or fiction: Verifying scientific claims. In Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, Online, 2020. Association for Computational Linguistics. 3 11

  62. [62]

    A Comprehensive Survey on Cross-modal Retrieval

    Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. A comprehensive survey on cross-modal retrieval. CoRR, abs/1607.06215, 2016. 2

  63. [63]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre- training. CoRR, abs/2212.03533, 2022. 3

  64. [64]

    Improving text embed- dings with large language models

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embed- dings with large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 11897–11916, Bangkok, Thailand, 2024. Association for Computational Linguistics. 3, 4

  65. [65]

    ONE-PEACE: exploring one general representation model toward unlimited modalities

    Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiao- huan Zhou, Jingren Zhou, Xinggang Wang, and Chang Zhou. ONE-PEACE: exploring one general representation model toward unlimited modalities. CoRR, abs/2305.11172, 2023. 6, 7

  66. [66]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. CoRR, abs/2409.12191, 2024. 1, 2, 6, 15

  67. [67]

    Uniir: Training and benchmarking universal multimodal information retriev- ers

    Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retriev- ers. In 18th European Conference on Computer Vision, page 387–404, Milan, Italy, 2024. Springer-Verlag. 1, 2, 3, 6, 7, 16, 18

  68. [68]

    Fashion IQ: A new dataset towards retrieving images by natural language feedback

    Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rog ´erio Feris. Fashion IQ: A new dataset towards retrieving images by natural language feedback. In IEEE Conference on Computer Vision and Pat- tern Recognition, CVPR 2021 , pages 11307–11317. Com- puter Vision Foundation / IEEE, 2021. 3, 14

  69. [69]

    C-pack: Packed resources for general chinese embeddings

    Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C-pack: Packed resources for general chinese embeddings. In Proceedings of the 47th In- ternational ACM SIGIR Conference on Research and Devel- opment in Information Retrieval, page 641–649, New York, NY , USA, 2024. Association for Computing Machinery. 3

  70. [70]

    Bennett, Junaid Ahmed, and Arnold Overwijk

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In 9th International Conference on Learning Representations, 2021. 4

  71. [71]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro- cessing, pages 2369–2380, Brussels, Belgium, 2018. Asso- ciation for Computation...

  72. [72]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm-v: A GPT-4V level MLLM on your phone. CoRR, abs/2408.01800, 2024. 2

  73. [73]

    A survey on multimodal large language models

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. National Science Review, 11(12), 2024. 2

  74. [74]

    Dense text retrieval based on pretrained language models: A survey

    Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. Dense text retrieval based on pretrained language models: A survey. ACM Trans. Inf. Syst., 42(4):89:1–89:60, 2024. 2

  75. [75]

    VISTA: Visualized text embedding for univer- sal multi-modal retrieval

    Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, and Yong- ping Xiong. VISTA: Visualized text embedding for univer- sal multi-modal retrieval. In Proceedings of the 62nd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers) , pages 3185–3200, Bangkok, Thailand, 2024. Association for Computational Linguistics. 1, 2, 6, 7

  76. [76]

    MARVEL: Unlock- ing the multi-modal capability of dense retrieval via visual module plugin

    Tianshuo Zhou, Sen Mei, Xinze Li, Zhenghao Liu, Chenyan Xiong, Zhiyuan Liu, Yu Gu, and Ge Yu. MARVEL: Unlock- ing the multi-modal capability of dense retrieval via visual module plugin. In Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 14608–14624, Bangkok, Thailand,

  77. [77]

    Association for Computational Linguistics. 1, 2

  78. [78]

    Towards complex doc- ument understanding by discrete reasoning

    Fengbin Zhu, Wenqiang Lei, Fuli Feng, Chao Wang, Haozhou Zhang, and Tat-Seng Chua. Towards complex doc- ument understanding by discrete reasoning. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022 , pages 4857–4866. ACM,

  79. [79]

    I →T”, which retrieves the caption given an image and “T→I

    UMRB Details Table 6 summarizes all UMRB tasks along with their statis- tics. Table 14 provides examples of different task types. Below is a brief description of each dataset included in the UMRB. 7.1. Single-Modal Tasks WebQA [4] This dataset is derived from Wikipedia. In the T→T setup, both the query and candidate are text. The objective is to find a Wi...

  80. [80]

    Addi- tionally, we provide results from other benchmarks, includ- ing BEIR, M-BEIR, and ViDoRe

    Results Details In this section, we present the detailed scores achieved by our GME and the baseline models on various tasks. Addi- tionally, we provide results from other benchmarks, includ- ing BEIR, M-BEIR, and ViDoRe. 8.1. Detailed Results on UMRB Table 7 presents the detailed evaluation results of the base- line systems alongside our GME on UMRB task...

Showing first 80 references.