arxiv: 2412.16855 · v2 · submitted 2024-12-22 · 💻 cs.CL · cs.IR

Recognition: 2 theorem links

· Lean Theorem

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Xin Zhang , Yanzhao Zhang , Wen Xie , Mingxin Li , Ziqi Dai , Dingkun Long , Pengjun Xie , Meishan Zhang

show 2 more authors

Wenjie Li Min Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:31 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords universal multimodal retrievalmultimodal large language modelsfused-modal training datadense retrieverdata synthesis pipelineUMR benchmarkcross-modal searchsynthetic dataset

0 comments

The pith

Training an MLLM on synthetically balanced fused text-image data produces a single dense retriever that leads on universal multimodal search tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that multimodal large language models can serve as universal retrievers when given diverse training examples that combine text and images, rather than text alone. Earlier attempts stayed limited because available multimodal data was heavily skewed toward one modality, so the authors built a synthesis pipeline to create a large, balanced fused-modal dataset. They then train the General Multimodal Embedder on this data and release a new benchmark covering pure text, pure image, and mixed queries. A sympathetic reader would care because a single model could replace separate text and image search systems and handle queries that mix modalities without retraining or reindexing.

Core claim

The General Multimodal Embedder is an MLLM turned into a dense retriever by training it on a large-scale synthetic fused-modal dataset constructed through a dedicated synthesis pipeline; this training regime lifts performance to state-of-the-art levels on the new Universal Multimodal Retrieval Benchmark across text-only, image-only, and mixed-modality query-candidate pairs.

What carries the argument

The General Multimodal Embedder (GME), an MLLM-based dense retriever whose embeddings are learned from the authors' synthetically generated fused-modal training set to support retrieval regardless of whether queries and candidates are text, images, or combinations.

If this is right

A single embedding space now supports retrieval when the query is text, an image, or both, and the candidate set can be any of the same combinations.
Model scaling and careful choice of training strategy continue to raise accuracy on the UMR benchmark once the fused data is available.
Ablation results isolate the contribution of data diversity and show that removing the synthesis step drops performance back toward prior text-only baselines.
The new UMRB provides a standardized test bed that future universal retrievers can be measured against directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar synthesis pipelines could be applied to add video or audio modalities without requiring massive new human annotation.
Real-world search engines might adopt one index instead of maintaining separate text and vision indexes, lowering storage and maintenance costs.
The same training recipe could be tested on open-ended multimodal question answering to check whether retrieval gains translate to generation tasks.

Load-bearing premise

The synthetic fused-modal training dataset is of high quality and sufficiently diverse to unlock MLLM potential for universal retrieval without introducing biases or artifacts.

What would settle it

If an identically sized MLLM trained on an equal volume of real, balanced multimodal data instead of the synthetic set achieves equal or higher accuracy on the UMRB, the necessity of the synthesis pipeline for the claimed gains would be refuted.

read the original abstract

Universal Multimodal Retrieval (UMR) aims to enable search across various modalities using a unified model, where queries and candidates can consist of pure text, images, or a combination of both. Previous work has attempted to adopt multimodal large language models (MLLMs) to realize UMR using only text data. However, our preliminary experiments demonstrate that more diverse multimodal training data can further unlock the potential of MLLMs. Despite its effectiveness, the existing multimodal training data is highly imbalanced in terms of modality, which motivates us to develop a training data synthesis pipeline and construct a large-scale, high-quality fused-modal training dataset. Based on the synthetic training data, we develop the General Multimodal Embedder (GME), an MLLM-based dense retriever designed for UMR. Furthermore, we construct a comprehensive UMR Benchmark (UMRB) to evaluate the effectiveness of our approach. Experimental results show that our method achieves state-of-the-art performance among existing UMR methods. Last, we provide in-depth analyses of model scaling and training strategies, and perform ablation studies on both the model and synthetic data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real move is a synthetic data pipeline to balance multimodal examples for MLLM-based universal retrieval, plus GME and the UMRB benchmark, but the SOTA claim sits on unverified data quality.

read the letter

The main takeaway is that they built a training data synthesis pipeline to create more balanced fused-modal examples, trained the General Multimodal Embedder (GME) on it, and released the UMRB benchmark, reporting better results than prior text-only MLLM approaches for universal multimodal retrieval. This directly tackles the modality imbalance they observed in existing data. The scaling analyses and ablations on training strategies are useful additions that show what matters when adapting these models for retrieval. The preliminary experiments motivating the pipeline are a straightforward way to justify the work. The soft spot is the heavy reliance on the synthetic dataset without clear external checks for quality, diversity, or biases like hallucinations or modality skew. The abstract gives limited experimental details on baselines, metrics, and significance, so the evidence for SOTA feels preliminary until the full numbers and controls are examined. No obvious circularity or citation gaps stand out. This paper is for people working on multimodal search and MLLM adaptation in information retrieval. Readers building or evaluating cross-modal systems would get practical value from the benchmark and pipeline description. I would send it to peer review because the new benchmark and data approach are concrete enough to deserve referee scrutiny, even with revisions needed on validation.

Referee Report

1 major / 2 minor

Summary. The paper proposes the General Multimodal Embedder (GME), an MLLM-based dense retriever for universal multimodal retrieval (UMR) that supports text, image, and fused-modal queries/candidates. To overcome modality imbalance in prior training data, the authors introduce a synthesis pipeline that constructs a large-scale fused-modal dataset; they also release the UMR Benchmark (UMRB) and report that GME attains state-of-the-art results on it, supported by scaling studies, training-strategy analyses, and ablations.

Significance. If the synthetic-data quality and experimental superiority hold, the work would meaningfully advance UMR by showing that carefully balanced multimodal training data can better exploit MLLM capacity for cross-modal retrieval, providing both a practical model and a new evaluation benchmark.

major comments (1)

[§3 and §4] §3 (Dataset Synthesis Pipeline) and §4 (Experiments): the central SOTA claim on UMRB rests on the assumption that the synthetic fused-modal dataset is high-quality, balanced, and free of systematic artifacts or hallucinations; however, the manuscript reports no independent quantitative checks (e.g., modality-balance statistics, diversity metrics, or human validation of generated pairs) that would confirm this assumption, leaving the performance gains vulnerable to data-induced bias.

minor comments (2)

[Abstract and §4] Abstract and §4: experimental details on exact baselines, evaluation metrics, statistical significance testing, and hyper-parameter settings are only sketched; these should be expanded with concrete numbers and tables for reproducibility.
[§5] §5 (Ablations): the scaling and training-strategy analyses would benefit from clearer notation distinguishing the contributions of data volume versus data modality diversity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comment point by point below.

read point-by-point responses

Referee: [§3 and §4] §3 (Dataset Synthesis Pipeline) and §4 (Experiments): the central SOTA claim on UMRB rests on the assumption that the synthetic fused-modal dataset is high-quality, balanced, and free of systematic artifacts or hallucinations; however, the manuscript reports no independent quantitative checks (e.g., modality-balance statistics, diversity metrics, or human validation of generated pairs) that would confirm this assumption, leaving the performance gains vulnerable to data-induced bias.

Authors: We thank the referee for this important observation. The synthesis pipeline in §3 incorporates explicit balancing steps (equal sampling across text-only, image-only, and fused-modal pairs) and quality filters (heuristic length checks plus MLLM-based relevance scoring) to mitigate imbalance and hallucinations. However, we acknowledge that the original manuscript did not report standalone quantitative validation metrics for the final dataset. In the revised version we will add: (i) modality-balance statistics (exact counts and percentages of each modality combination), (ii) diversity metrics (average token length, unique n-gram coverage, and average pairwise embedding cosine similarity), and (iii) a human validation study on a random 500-pair subset reporting hallucination and relevance rates. These results will appear in §3 with supporting tables and examples moved to the appendix. We believe the added evidence will directly address the concern while preserving the experimental claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical training and evaluation chain is self-contained

full rationale

The paper constructs a synthetic fused-modal dataset to address modality imbalance in existing data, trains the GME MLLM-based retriever on it, builds the UMRB benchmark, and reports SOTA empirical results. No derivation step reduces by construction to its inputs, no fitted parameter is relabeled as a prediction, and no load-bearing claim rests on self-citation chains or imported uniqueness theorems. All central results derive from standard training-plus-evaluation on held-out benchmarks rather than tautological re-expression of the synthesis pipeline or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim depends on the assumption that synthetic data can effectively substitute for real data and that MLLMs are suitable for embedding-based retrieval.

axioms (1)

domain assumption Diverse multimodal training data improves MLLM performance on UMR tasks
Based on preliminary experiments mentioned in the abstract.

pith-pipeline@v0.9.0 · 5516 in / 1211 out tokens · 48454 ms · 2026-05-15T06:31:17.683583+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a novel data synthesis pipeline for constructing large-scale, fused-modal training data... This pipeline is more efficient than previous approaches

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
cs.LG 2026-05 unverdicted novelty 7.0

BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
cs.CL 2026-05 unverdicted novelty 7.0

Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
cs.IR 2026-04 unverdicted novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval
cs.SE 2026-04 unverdicted novelty 7.0

CodeMMR creates a unified embedding space for text, code, and images, outperforming baselines by 10 nDCG@10 points and boosting RAG code generation quality.
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.
Bottleneck Tokens for Unified Multimodal Retrieval
cs.LG 2026-04 unverdicted novelty 7.0

Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
cs.CV 2026-04 unverdicted novelty 7.0

ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.
MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL
cs.IR 2026-04 unverdicted novelty 7.0

MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.
Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval
cs.CV 2026-04 unverdicted novelty 7.0

Introduces the OACIR task requiring instance-level consistency via bounding-box anchors, a 160K real-world benchmark OACIRR, and the AdaFocal framework that adaptively focuses attention on the anchored region.
PLUME: Latent Reasoning Based Universal Multimodal Embedding
cs.CV 2026-04 unverdicted novelty 7.0

PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
cs.CL 2026-05 unverdicted novelty 6.0

GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
cs.CV 2026-04 unverdicted novelty 6.0

A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...
Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
cs.CV 2026-04 unverdicted novelty 6.0

Rewrite-driven generation with alignment and RL produces shorter, more effective generative multimodal embeddings than CoT methods on retrieval benchmarks.
MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment
cs.CV 2026-04 unverdicted novelty 6.0

MiMIC mitigates visual modality collapse and semantic misalignment in universal multimodal retrieval via fusion-in-decoder architecture and robust single-modality training.
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
cs.CV 2026-04 conditional novelty 6.0

SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.
ViLL-E: Video LLM Embeddings for Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval
cs.IR 2026-04 unverdicted novelty 6.0

HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...
A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval
cs.CV 2026-05 conditional novelty 5.0

Single-vector aggregation in visual financial document retrieval collapses semantically distinct documents due to global texture dominance, as demonstrated by a new diagnostic benchmark where patch-level signals detec...
TriAlignGR: Triangular Multitask Alignment with Multimodal Deep Interest Mining for Generative Recommendation
cs.IR 2026-05 unverdicted novelty 5.0

TriAlignGR integrates visual content and latent user interests into Semantic IDs via cross-modal alignment, CoT-based interest mining, and triangular multitask training to address content degradation and semantic opac...
Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval
cs.CV 2026-04 unverdicted novelty 5.0

SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.
BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment
cs.IR 2026-04 unverdicted novelty 5.0

BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retrie...

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 19 Pith papers · 8 internal anchors

[1]

Overview of touch ´e 2020: Argument retrieval - extended abstract

Alexander Bondarenko, Maik Fr ¨obe, Meriem Beloucif, Lukas Gienapp, Yamen Ajjour, Alexander Panchenko, Chris Biemann, Benno Stein, Henning Wachsmuth, Martin Pot- thast, and Matthias Hagen. Overview of touch ´e 2020: Argument retrieval - extended abstract. In Experimental IR Meets Multilinguality, Multimodality, and Interaction - CLEF 2020, pages 384–395. ...

work page 2020
[2]

A full-text learning to rank dataset for medical information retrieval

Vera Boteva, Demian Gholipour Ghalandari, Artem Sokolov, and Stefan Riezler. A full-text learning to rank dataset for medical information retrieval. In Advances in Information Retrieval - 38th European Conference on IR Research, ECIR 2016, Padua, Italy, pages 716–722. Springer, 2016. 3

work page 2016
[3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz...

work page 2020
[4]

Webqa: Multihop and multimodal QA

Yingshan Chang, Guihong Cao, Mridu Narang, Jianfeng Gao, Hisami Suzuki, and Yonatan Bisk. Webqa: Multihop and multimodal QA. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16474–16483, 2022. 1, 3, 12, 14

work page 2022
[5]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. CoRR, abs/1604.06174, 2016. 6

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pages 14948–14968, Singapore, 2023. Association for Computational Linguistics....

work page 2023
[7]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101,

work page
[8]

SPECTER: Document-level representa- tion learning using citation-informed transformers

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. SPECTER: Document-level representa- tion learning using citation-informed transformers. In Pro- ceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online, 2020. Association for Computational Linguistics. 3

work page 2020
[9]

Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B

Zhuyun Dai, Vincent Y . Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B. Hall, and Ming-Wei Chang. Promptagator: Few-shot dense retrieval from 8 examples. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 6, 18

work page 2023
[10]

Imagenet: A large-scale hierarchical im- age database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Miami, USA , pages 248–255. IEEE Computer Society, 2009. 4, 6

work page 2009
[11]

Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, and Markus Leippold

Thomas Diggelmann, Jordan L. Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, and Markus Leippold. CLIMATE- FEVER: A dataset for verification of real-world climate claims. CoRR, abs/2012.00614, 2020. 3

work page arXiv 2012
[12]

Col- pali: Efficient document retrieval with vision language mod- els

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C ´eline Hudelot, and Pierre Colombo. Col- pali: Efficient document retrieval with vision language mod- els. In The Thirteenth International Conference on Learning Representations, 2025. 2, 3, 7, 14, 16

work page 2025
[13]

Dreamsim: Learning new dimensions of human visual simi- larity using synthetic data

Stephanie Fu, Netanel Yakir Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual simi- larity using synthetic data. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 3, 12

work page 2023
[14]

SimCSE: Simple contrastive learning of sentence embeddings

Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. 2, 6

work page 2021
[15]

Doc2query-: When less is more

Mitko Gospodinov, Sean MacAvaney, and Craig Macdonald. Doc2query-: When less is more. In Advances in Information Retrieval - 45th European Conference on Information Re- trieval, ECIR 2023 , pages 414–422, Dublin, Ireland, 2023. Springer. 5

work page 2023
[16]

Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S

Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. Au- tomatic spatially-aware fashion concept discovery. In IEEE International Conference on Computer Vision, ICCV 2017 , pages 1472–1480, Venice, Italy, 2017. IEEE Computer Soci- ety. 3, 14

work page 2017
[17]

Dbpedia-entity v2: A test collection for entity search

Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik Bratsberg, Alexander Kotov, and Jamie Callan. Dbpedia-entity v2: A test collection for entity search. In Proceedings of the 40th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval, page 1265–1268, New York, NY , USA, 2017. Association for Comp...

work page 2017
[18]

Verspoor, and Timothy Bald- win

Doris Hoogeveen, Karin M. Verspoor, and Timothy Bald- win. Cqadupstack: A benchmark data set for community question-answering research. In Proceedings of the 20th Australasian Document Computing Symposium , New York, NY , USA, 2015. Association for Computing Machinery. 3

work page 2015
[19]

9 LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 9 LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations , 2022. 6, 15

work page 2022
[20]

Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities

Hexiang Hu, Yi Luan, Yang Chen, Urvashi Khandelwal, Mandar Joshi, Kenton Lee, Kristina Toutanova, and Ming- Wei Chang. Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. In IEEE/CVF International Conference on Computer Vision, ICCV 2023 , pages 12031–12041, Paris, France, 2023. IEEE. 3, 14

work page 2023
[21]

Unsupervised dense information retrieval with con- trastive learning

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebas- tian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with con- trastive learning. Transactions on Machine Learning Re- search, 2022. 3

work page 2022
[22]

E5-V: universal embeddings with multi- modal large language models

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-V: universal embeddings with multi- modal large language models. CoRR, abs/2407.12580, 2024. 1, 2, 3, 4, 6, 7

work page arXiv 2024
[23]

VLM2vec: Training vision- language models for massive multimodal embedding tasks

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. VLM2vec: Training vision- language models for massive multimodal embedding tasks. In The Thirteenth International Conference on Learning Representations, 2025. 2, 3

work page 2025
[24]

TriviaQA: A large scale distantly supervised chal- lenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettle- moyer. TriviaQA: A large scale distantly supervised chal- lenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers) , pages 1601– 1611, Vancouver, Canada, 2017. Association for Computa- tional Linguistics. 6

work page 2017
[25]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen- tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, 2020. Association for Computa- tional Linguistics. 1

work page 2020
[26]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Ep- stein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering re- search. Tra...

work page 2019
[27]

Building and better understanding vision- language models: insights and future directions

Hugo Laurenc ¸on, Andr´es Marafioti, Victor Sanh, and Leo Tronchon. Building and better understanding vision- language models: insights and future directions. In Work- shop on Responsibly Building the Next Generation of Multi- modal Foundational Models, 2024. 5, 6

work page 2024
[28]

NV- embed: Improved techniques for training LLMs as generalist embedding models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. NV- embed: Improved techniques for training LLMs as generalist embedding models. In The Thirteenth International Confer- ence on Learning Representations, 2025. 3

work page 2025
[29]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. InIn- ternational Conference on Machine Learning, ICML 2022 , pages 12888–12900, Baltimore, Maryland, USA, 2022. PMLR. 2

work page 2022
[30]

Multimodal ArXiv: A dataset for improving scientific comprehension of large vision-language models

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal ArXiv: A dataset for improving scientific comprehension of large vision-language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 14369–14387, Bangkok, Thailand, 2024. Association ...

work page 2024
[31]

Towards General Text Embeddings with Multi-stage Contrastive Learning

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning. CoRR, abs/2308.03281, 2023. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In 13th European Conference on Computer Vision, ECCV 2014 , pages 740–755, Zurich, Switzerland, 2014. Springer. 3, 6, 14

work page 2014
[33]

PreFLMR: Scaling up fine-grained late-interaction multi- modal retrievers

Weizhe Lin, Jingbiao Mei, Jinghong Chen, and Bill Byrne. PreFLMR: Scaling up fine-grained late-interaction multi- modal retrievers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5294–5316, Bangkok, Thailand, 2024. Association for Computational Linguistics. 3, 14

work page 2024
[34]

Visual news: Benchmark and challenges in news im- age captioning

Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Or- donez. Visual news: Benchmark and challenges in news im- age captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 6761–6771, Online and Punta Cana, Dominican Republic,

work page 2021
[35]

Association for Computational Linguistics. 3, 13

work page
[36]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 1, 2

work page 2023
[37]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 2

work page 2024
[38]

EDIS: Entity-driven image search over multimodal web content

Siqi Liu, Weixi Feng, Tsu-Jui Fu, Wenhu Chen, and William Wang. EDIS: Entity-driven image search over multimodal web content. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 4877–4894, Singapore, 2023. Association for Computational Linguistics. 3, 14

work page 2023
[39]

Image retrieval on real-life images with pre-trained vision-and-language models

Zheyuan Liu, Cristian Rodriguez Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021 , pages 2105–2114, Montreal, Canada, 2021. IEEE. 3, 5, 14

work page 2021
[40]

Universal vision-language dense retrieval: Learning a unified representation space for multi-modal re- trieval

Zhenghao Liu, Chenyan Xiong, Yuanhuiyi Lv, Zhiyuan Liu, and Ge Yu. Universal vision-language dense retrieval: Learning a unified representation space for multi-modal re- trieval. In The Eleventh International Conference on Learn- ing Representations, 2023. 1, 2

work page 2023
[41]

End-to-end knowledge retrieval with multi- 10 modal queries

Man Luo, Zhiyuan Fang, Tejas Gokhale, Yezhou Yang, and Chitta Baral. End-to-end knowledge retrieval with multi- 10 modal queries. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8573–8589, Toronto, Canada, 2023. As- sociation for Computational Linguistics. 3, 14

work page 2023
[42]

Unifying multimodal retrieval via document screenshot embedding

Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin. Unifying multimodal retrieval via document screenshot embedding. In Proceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Processing, pages 6492–6505, Miami, Florida, USA, 2024. Association for Computational Linguistics. 2, 6, 7

work page 2024
[43]

Www’18 open challenge: Financial opinion min- ing and question answering

Macedo Maia, Siegfried Handschuh, Andr ´e Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. Www’18 open challenge: Financial opinion min- ing and question answering. In Companion of the The Web Conference 2018 on The Web Conference 2018, pages 1941– 1942, Lyon, France, 2018. ACM. 3

work page 2018
[44]

OK-VQA: A visual question answer- ing benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A visual question answer- ing benchmark requiring external knowledge. In IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2019, pages 3195–3204, Long Beach, CA, USA, 2019. Com- puter Vision Foundation / IEEE. 3, 14

work page 2019
[45]

Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawa- har. Docvqa: A dataset for VQA on document images. In IEEE Winter Conference on Applications of Computer Vi- sion, WACV 2021 , pages 2199–2208, Waikoloa, HI, USA,

work page 2021
[46]

Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V . Jawahar. Infograph- icvqa. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022, pages 2582–2591. IEEE, 2022. 3

work page 2022
[47]

Thomas Mensink, Jasper R. R. Uijlings, Llu ´ıs Castrej´on, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, Andr ´e Ara´ujo, and Vittorio Ferrari. Encyclopedic VQA: visual questions about detailed properties of fine-grained cate- gories. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, pages 3090–3101, Paris, France, 2023. IEEE. 3, 5, 14

work page 2023
[48]

MS MARCO: A human generated machine reading comprehen- sion dataset

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO: A human generated machine reading comprehen- sion dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016, Barcelona, Spain, 2016. CEUR-WS.org. 3, 4, 6

work page 2016
[49]

Gpt-4v(ision) system card, 2023

OpenAI. Gpt-4v(ision) system card, 2023. 2

work page 2023
[50]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Plummer, Liwei Wang, Chris M

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, pages 2641–2649, Santiago, Chile, 2015. IEEE Com- puter Society. 3, 14

work page 2015
[52]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, pages 8748–8763....

work page 2021
[53]

SQuAD: 100,000+ questions for machine com- prehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine com- prehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages 2383–2392, Austin, Texas, 2016. Association for Computa- tional Linguistics. 6

work page 2016
[54]

Contrastive learning with hard negative samples

Joshua David Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples. In International Conference on Learning Repre- sentations, 2021. 4

work page 2021
[55]

LAION-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text m...

work page 2022
[56]

BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models

Nandan Thakur, Nils Reimers, Andreas R ¨uckl´e, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. 2, 3, 7, 13, 16, 18

work page 2021
[57]

FEVER: a large-scale dataset for fact extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopou- los, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long Papers), pages 809–819, New Or- leans, Louis...

work page 2018
[58]

Representation Learning with Contrastive Predictive Coding

A ¨aron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018. 4

work page internal anchor Pith review Pith/arXiv arXiv 2018
[59]

Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang

Ellen V oorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R. Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. Trec-covid: constructing a pandemic information retrieval test collection.SIGIR Forum, 54(1), 2021. 3

work page 2021
[60]

Retrieval of the best counterargument without prior topic knowledge

Henning Wachsmuth, Shahbaz Syed, and Benno Stein. Retrieval of the best counterargument without prior topic knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 241–251, Melbourne, Australia, 2018. Asso- ciation for Computational Linguistics. 3

work page 2018
[61]

Fact or fiction: Verifying scientific claims

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Ha- jishirzi. Fact or fiction: Verifying scientific claims. In Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, Online, 2020. Association for Computational Linguistics. 3 11

work page 2020
[62]

A Comprehensive Survey on Cross-modal Retrieval

Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. A comprehensive survey on cross-modal retrieval. CoRR, abs/1607.06215, 2016. 2

work page internal anchor Pith review Pith/arXiv arXiv 2016
[63]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre- training. CoRR, abs/2212.03533, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

Improving text embed- dings with large language models

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embed- dings with large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 11897–11916, Bangkok, Thailand, 2024. Association for Computational Linguistics. 3, 4

work page 2024
[65]

ONE-PEACE: exploring one general representation model toward unlimited modalities

Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiao- huan Zhou, Jingren Zhou, Xinggang Wang, and Chang Zhou. ONE-PEACE: exploring one general representation model toward unlimited modalities. CoRR, abs/2305.11172, 2023. 6, 7

work page arXiv 2023
[66]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. CoRR, abs/2409.12191, 2024. 1, 2, 6, 15

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Uniir: Training and benchmarking universal multimodal information retriev- ers

Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retriev- ers. In 18th European Conference on Computer Vision, page 387–404, Milan, Italy, 2024. Springer-Verlag. 1, 2, 3, 6, 7, 16, 18

work page 2024
[68]

Fashion IQ: A new dataset towards retrieving images by natural language feedback

Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rog ´erio Feris. Fashion IQ: A new dataset towards retrieving images by natural language feedback. In IEEE Conference on Computer Vision and Pat- tern Recognition, CVPR 2021 , pages 11307–11317. Com- puter Vision Foundation / IEEE, 2021. 3, 14

work page 2021
[69]

C-pack: Packed resources for general chinese embeddings

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C-pack: Packed resources for general chinese embeddings. In Proceedings of the 47th In- ternational ACM SIGIR Conference on Research and Devel- opment in Information Retrieval, page 641–649, New York, NY , USA, 2024. Association for Computing Machinery. 3

work page 2024
[70]

Bennett, Junaid Ahmed, and Arnold Overwijk

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In 9th International Conference on Learning Representations, 2021. 4

work page 2021
[71]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro- cessing, pages 2369–2380, Brussels, Belgium, 2018. Asso- ciation for Computation...

work page 2018
[72]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm-v: A GPT-4V level MLLM on your phone. CoRR, abs/2408.01800, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

A survey on multimodal large language models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. National Science Review, 11(12), 2024. 2

work page 2024
[74]

Dense text retrieval based on pretrained language models: A survey

Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. Dense text retrieval based on pretrained language models: A survey. ACM Trans. Inf. Syst., 42(4):89:1–89:60, 2024. 2

work page 2024
[75]

VISTA: Visualized text embedding for univer- sal multi-modal retrieval

Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, and Yong- ping Xiong. VISTA: Visualized text embedding for univer- sal multi-modal retrieval. In Proceedings of the 62nd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers) , pages 3185–3200, Bangkok, Thailand, 2024. Association for Computational Linguistics. 1, 2, 6, 7

work page 2024
[76]

MARVEL: Unlock- ing the multi-modal capability of dense retrieval via visual module plugin

Tianshuo Zhou, Sen Mei, Xinze Li, Zhenghao Liu, Chenyan Xiong, Zhiyuan Liu, Yu Gu, and Ge Yu. MARVEL: Unlock- ing the multi-modal capability of dense retrieval via visual module plugin. In Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 14608–14624, Bangkok, Thailand,

work page
[77]

Association for Computational Linguistics. 1, 2

work page
[78]

Towards complex doc- ument understanding by discrete reasoning

Fengbin Zhu, Wenqiang Lei, Fuli Feng, Chao Wang, Haozhou Zhang, and Tat-Seng Chua. Towards complex doc- ument understanding by discrete reasoning. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022 , pages 4857–4866. ACM,

work page 2022
[79]

I →T”, which retrieves the caption given an image and “T→I

UMRB Details Table 6 summarizes all UMRB tasks along with their statis- tics. Table 14 provides examples of different task types. Below is a brief description of each dataset included in the UMRB. 7.1. Single-Modal Tasks WebQA [4] This dataset is derived from Wikipedia. In the T→T setup, both the query and candidate are text. The objective is to find a Wi...

work page
[80]

Addi- tionally, we provide results from other benchmarks, includ- ing BEIR, M-BEIR, and ViDoRe

Results Details In this section, we present the detailed scores achieved by our GME and the baseline models on various tasks. Addi- tionally, we provide results from other benchmarks, includ- ing BEIR, M-BEIR, and ViDoRe. 8.1. Detailed Results on UMRB Table 7 presents the detailed evaluation results of the base- line systems alongside our GME on UMRB task...

work page

Showing first 80 references.