pith. sign in

arxiv: 2606.04240 · v1 · pith:WR52RLEOnew · submitted 2026-06-02 · 💻 cs.CV · cs.AI· cs.CL

Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

Pith reviewed 2026-06-28 10:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords multimodal document retrievalQwen2-VLMultimodal-LLM embeddersvisually-rich documentsretrieval challengeMMDocIRM2KRtraining-free retrieval
0
0 comments X

The pith

Decoder-based Qwen2-VL embedders power the winning systems in the multimodal document retrieval challenge, with a training-free entry nearly matching the fine-tuned leader.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports results from a challenge requiring one retrieval system to handle both closed-set page retrieval inside long documents from text queries and open-domain passage retrieval from images or image-text queries. Systems were ranked by the macro-average of mean Recall at 1, 3, and 5 across the two tasks. Analysis of the top three entries shows they all rely on decoder-based Multimodal-LLM embedders from the Qwen2-VL family instead of CLIP-style encoders. The best training-free system, using zero-shot late interaction, scored within 0.1 points of the leading fine-tuned ensemble.

Core claim

All three winning teams built their systems on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders. The teams differed mainly in whether they reached the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within 0.1 point of the fine-tuned winner on the macro-averaged metric.

What carries the argument

Decoder-based Multimodal-LLM embedders from the Qwen2-VL family, which generate the embeddings used for ranking in both retrieval regimes.

If this is right

  • Decoder-based embedders from the Qwen2-VL family outperform CLIP-style encoders on both closed-set document page retrieval and open-domain image-based passage retrieval.
  • Training-free methods using zero-shot late interaction can reach performance levels within 0.1 points of heavily fine-tuned ensembles.
  • Multi-route fusion combined with a vision-language re-ranker offers a competitive alternative to full fine-tuning.
  • A single system architecture can effectively address complementary retrieval regimes when evaluated on the combined macro-average metric.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pattern suggests decoder-only multimodal models may generalize better than contrastive encoders when documents contain interleaved figures, tables, and charts.
  • The near-parity of training-free systems could reduce the data and compute barriers for deploying multimodal retrievers in practice.
  • These findings point toward testing whether similar decoder embedders maintain their edge on larger collections or additional languages.
  • The results have direct bearing on retrieval-augmented generation pipelines that must surface visually rich content accurately.

Load-bearing premise

The macro-average of mean Recall at 1, 3 and 5 over the two tasks provides a fair and representative ranking of retrieval systems across the two regimes.

What would settle it

A retrieval system built on a CLIP-style encoder that scores higher than the Qwen2-VL winners on the same macro-averaged Recall metric across both tasks would falsify the observed pattern.

Figures

Figures reproduced from arXiv: 2606.04240 by Jingbiao Mei.

Figure 1
Figure 1. Figure 1: One unified model must serve two complemen [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0.1$ point of the fine-tuned winner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript is an overview report on the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1). It describes the challenge motivation for multimodal retrieval over visually-rich documents, the two tasks (closed-set MMDocIR page retrieval from text queries and open-domain M2KR passage retrieval from image or image+text queries), the ranking metric (macro-average of mean Recall@{1,3,5} across tasks), participation (455 entrants, 586 submissions, 22 teams), and the architectures and relative performance of the top three systems, all of which rely on Qwen2-VL decoder-based Multimodal-LLM embedders rather than CLIP-style encoders, with a training-free multi-route system finishing within 0.1 points of the fine-tuned winner.

Significance. If the reported participation numbers, standings, and system descriptions hold, the paper supplies a useful empirical snapshot of current practice in multimodal document retrieval. It documents the shift toward decoder-based MLLM embedders and the viability of training-free fusion approaches, which can inform subsequent work on retrieval-augmented generation over interleaved text, figures, and tables.

minor comments (3)
  1. [Abstract] The abstract and § on final standings refer to a '0.1-point gap' without quoting the exact macro-average scores of the top two systems; adding these numbers would improve precision.
  2. A compact table listing the three winning teams, their key design choices (fine-tuning, fusion, late interaction), and per-task Recall values would make the comparative analysis easier to follow.
  3. [Challenge Design] The description of the M2KR task should explicitly note the size and source of the Wikipedia passage corpus used for open-domain retrieval to allow readers to assess scale.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading and positive assessment of our manuscript, including the recommendation for minor revision. The referee summary accurately reflects the challenge overview, participation statistics, and key findings regarding Qwen2-VL-based systems.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a descriptive challenge overview report with no equations, derivations, predictions, or first-principles claims. It reports external team submissions, architectures, and standings under fixed challenge rules and metrics without advancing any internal mathematical reduction, fitted parameter, or self-referential derivation. All content is observational reporting of independent external results, so no load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a challenge overview report rather than a derivation or empirical study, so the central claim rests on no free parameters, no domain axioms beyond standard information-retrieval evaluation practice, and no invented entities.

pith-pipeline@v0.9.1-grok · 5789 in / 1127 out tokens · 23205 ms · 2026-06-28T10:29:46.126525+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216 [cs.CL]

  2. [2]

    Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. 2023. Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computat...

  3. [3]

    Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. 2024. M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi- document Understanding. https://arxiv.org/abs/2411.04952v1

  4. [4]

    Kuicai Dong, Yujing Chang, Derrick Goh Xin Deik, Dexun Li, Ruiming Tang, and Yong Liu. 2025. MMDocIR: Benchmarking Multimodal Retrieval for Long Documents. InProceedings of the 2025 Conference on Empirical Methods in Nat- ural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Comput...

  5. [5]

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. 2024. ColPali: Efficient Document Retrieval with Vision Language Models. arXiv:2407.01449 [cs.IR] https://arxiv.org/abs/2407. 01449

  6. [6]

    Junchen Fu, Xuri Ge, Xin Xin, Haitao Yu, Yue Feng, Alexandros Karatzoglou, Ioan- nis Arapakis, and Joemon Jose. 2025. The 1st EReL@MIR Workshop on Efficient Representation Learning for Multimodal Information Retrieval. InCompanion Proceedings of the ACM on Web Conference 2025(Sydney NSW, Australia)(WWW ’25). Association for Computing Machinery, New York, ...

  7. [7]

    Bohan Hou, Haoqiang Lin, Xuemeng Song, Haokun Wen, and Liqiang Nie. 2025. Visual Anchor Point for Multimodal Document Retrieval. https://github.com/ hbhalpha/MDR. Winning solution, EReL@MIR 2025 MIRC Track 1

  8. [8]

    Hexiang Hu, Yi Luan, Yang Chen, Urvashi Khandelwal, Mandar Joshi, Ken- ton Lee, Kristina Toutanova, and Ming-Wei Chang. 2023. Open-domain Vi- sual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities. arXiv:2302.11154 (Feb. 2023). http://arxiv.org/abs/2302.11154 arXiv:2302.11154 [cs]

  9. [9]

    Bargav Jagatha and Abhishek Varshney. 2025. Multimodal Information Retrieval Challenge Solution. https://github.com/bargav25/MultiModal_ InformationRetrieval. Third-place solution, EReL@MIR 2025 MIRC Track 1

  10. [10]

    Ting Jiang, Shaohan Huang, Minghui Song, Zihan Zhang, Haizhen Huang, Liang Wang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, deqing wang, and Fuzhen Zhuang. 2025. E5-V: Universal Embeddings with Multimodal Large Language Models. https://openreview.net/forum?id=rD6LQagatR

  11. [11]

    Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 39–48. https://doi.org/10.1145/3397271.3401075

  12. [12]

    Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, and Bill Byrne. 2023. Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering. InAdvances in Neu- ral Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 22820–2284...

  13. [13]

    Weizhe Lin, Jingbiao Mei, Jinghong Chen, and Bill Byrne. 2024. PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok,...

  14. [14]

    Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin

  15. [15]

    InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.)

    Unifying Multimodal Retrieval via Document Screenshot Embedding. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 6492–6505. https://doi.org/10.18653/v1/2024.emnlp-main.373

  16. [16]

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi

  17. [17]

    InConference on Computer Vision and Pattern Recognition (CVPR)

    OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. InConference on Computer Vision and Pattern Recognition (CVPR)

  18. [18]

    Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, André Araujo, and Vittorio Ferrari. 2023. Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Paris, France, 3090–3101. https://doi.org/10.1109/IC...

  19. [19]

    Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labatu...

  20. [20]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning. PMLR, 8748–8763. https...

  21. [21]

    Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguisti...

  22. [22]

    Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. 2021. WIT: Wikipedia-based Image Text Dataset for Multimodal Multi- lingual Machine Learning. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2443–2449

  23. [23]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv:2409.12191 (Oct. 2024). htt...

  24. [24]

    Mingjun Xu, Zehui Wang, Hengxing Cai, and Renxin Zhong. 2025. A Multi- Granularity Retrieval Framework for Visually-Rich Documents. arXiv:2505.01457 (May 2025). https://doi.org/10.48550/arXiv.2505.01457 arXiv:2505.01457 [cs.IR]. Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

  25. [25]

    Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zheng- hao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025. Vis- RAG: Vision-based Retrieval-augmented Generation on Multi-modality Doc- uments. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=zG459X3Xge

  26. [26]

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid Loss for Language Image Pre-Training. In2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Paris, France, 11941–11952. https: //doi.org/10.1109/ICCV51070.2023.01100

  27. [27]

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. 2025. GME: Improving Universal Multimodal Retrieval by Multimodal LLMs. arXiv:2412.16855 (April 2025). https://doi.org/10.48550/arXiv.2412.16855 arXiv:2412.16855 [cs]

  28. [28]

    Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, and Yongping Xiong. 2024. VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Ban...