pith. machine review for the scientific record. sign in

arxiv: 2604.10167 · v1 · submitted 2026-04-11 · 💻 cs.CV · cs.CL· cs.IR

Recognition: unknown

Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

James Kwok, Jiahao Huo, Mingdong Ou, Shuliang Liu, Xin Zou, Xuming Hu, Yibo Yan, Yi Cao

Pith reviewed 2026-05-10 15:21 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.IR
keywords visual document retrievallate chunkinghierarchical clusteringmulti-vector modelsstorage reductioncontextual chunkingpatch embeddings2D position prior
0
0 comments X

The pith

ColChunk uses hierarchical clustering on image patch embeddings with a 2D position prior to create contextual chunks that cut storage by over 90 percent while raising retrieval accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ColChunk, a plug-and-play framework for visual document retrieval that applies multimodal late chunking to multi-vector models. It groups patch-level embeddings through hierarchical clustering guided by spatial position information, producing fewer but more coherent representations that retain global context. This directly addresses the high storage and compute costs that currently limit practical use of fine-grained matching in visual documents. A reader would care because the method delivers both major efficiency gains and better performance on standard benchmarks without changing the underlying models. The evaluations span 24 datasets and show consistent results across single-vector baselines.

Core claim

ColChunk introduces multimodal late chunking that performs hierarchical clustering on patch-level embeddings fused with a 2D position prior. This produces spatially and semantically coherent chunks that preserve global context while reducing the total number of vectors. The approach avoids fixed-token or pruning methods and instead adapts groupings to content. On 24 visual document retrieval datasets it yields over 90 percent storage reduction together with an average 9-point gain in nDCG@5 for representative single-vector models.

What carries the argument

ColChunk framework performing hierarchical clustering on patch-level embeddings fused with a 2D position prior to generate adaptive, content-aware chunks.

Load-bearing premise

Hierarchical clustering on patch-level embeddings fused with a 2D position prior will reliably produce spatially and semantically coherent chunks that preserve global context without losing critical retrieval information.

What would settle it

Testing ColChunk on a collection of images whose patch embeddings lack semantic structure, such as random noise patterns, and finding that retrieval scores drop or stay flat instead of improving would show the clustering step fails to maintain necessary information.

Figures

Figures reproduced from arXiv: 2604.10167 by James Kwok, Jiahao Huo, Mingdong Ou, Shuliang Liu, Xin Zou, Xuming Hu, Yibo Yan, Yi Cao.

Figure 2
Figure 2. Figure 2: The comparison of the average performance of ColChunk across different chunk size. The dash lines refer to the base results. indexing time, we use the document’s spatial-semantic structure 𝑑struct as a tractable proxy for 𝑞, reformulating the objective as minC[𝐼(C; Epatch) −𝛽𝐼(C;𝑑struct)]. ColChunk approximates this op￾timization in a training-free manner. The clustering and pooling stage, 𝑔 : Epatch ↦→ C,… view at source ↗
Figure 1
Figure 1. Figure 1: The performance comparison of ColChunk on five VDR benchmarks across five single-vector retrieval models. Clustering (HAC). we define an assignment function A : {1, . . . , 𝑁𝑣 } → {1, . . . , 𝐾} that partitions the 𝑁𝑣 patches into 𝐾 disjoint chunks {C1, C2, . . . , C𝐾 }, where C𝑘 = { 𝑗 | A (𝑗) = 𝑘}. A key characteristic of ColChunk is that the number of tokens assigned to each chunk is adaptive, denoted as… view at source ↗
Figure 3
Figure 3. Figure 3: The comparison of the overall performance of ColChunk: (Left): with different weighting factors (different color indicates different chunk size, dash lines refer to average base results, and stars refer to the best results); (Middle): with different chunking positions (after vision encoder vs. after LLM backbone) across five models; (Right): with different clustering methods (k-means vs. hierarchical) acro… view at source ↗
read the original abstract

Multi-vector models dominate Visual Document Retrieval (VDR) due to their fine-grained matching capabilities, but their high storage and computational costs present a major barrier to practical deployment. In this paper, we propose ColChunk, a plug-and-play framework that introduces multimodal late chunking to construct efficient, contextualized multi-vectors. Unlike existing pruning or fixed-token approaches, ColChunk employs hierarchical clustering on patch-level embeddings, fused with a 2D position prior to ensure spatial-semantic coherence. This adaptive grouping allows for a content-aware representation that preserves global context while drastically reducing the vector count. Evaluations across 24 VDR datasets demonstrate ColChunk achieves over a 90% reduction in storage requirements while simultaneously delivering a 9-point average improvement in nDCG@5 across representative single-vector models. ColChunk provides a practical solution for balancing retrieval accuracy and efficiency in visual document systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces ColChunk, a plug-and-play multimodal late chunking framework for visual document retrieval. It uses hierarchical clustering on patch-level embeddings fused with a 2D position prior to produce contextualized multi-vectors, claiming over 90% storage reduction and a 9-point average nDCG@5 improvement across 24 VDR datasets relative to single-vector models.

Significance. If the empirical results prove robust, the work would be significant for practical VDR deployment by reducing the storage burden of multi-vector models while reporting accuracy gains. The scale of evaluation across 24 datasets is a strength, but the absence of detailed controls limits the ability to assess generalizability.

major comments (3)
  1. [Abstract] Abstract: The central claims of >90% storage reduction and 9-point nDCG@5 gain are stated without any reference to the specific single-vector baselines, per-dataset breakdowns, statistical significance tests, or variance measures, which are required to substantiate the average improvement and efficiency results.
  2. [Method] Method section: The fusion of the 2D position prior into hierarchical clustering on patch embeddings is presented as ensuring spatial-semantic coherence, yet no ablation removing the 2D prior is reported, leaving the contribution of this component unisolated and the preservation of global context (e.g., table structures or cross-column references) unverified.
  3. [Experiments] Experiments: No comparisons against simpler fixed-size or purely semantic chunking baselines are included to demonstrate that the proposed hierarchical clustering with position prior is necessary for the reported gains rather than achievable by less complex alternatives.
minor comments (1)
  1. The abstract and method would benefit from a brief illustrative example or diagram showing chunk formation on a multi-column or tabular document to clarify how global context is preserved.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify how to better substantiate our claims. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of >90% storage reduction and 9-point nDCG@5 gain are stated without any reference to the specific single-vector baselines, per-dataset breakdowns, statistical significance tests, or variance measures, which are required to substantiate the average improvement and efficiency results.

    Authors: We agree that the abstract would be strengthened by greater specificity. In the revised version, we will update the abstract to explicitly name the single-vector baselines (e.g., the representative models evaluated), state that the 9-point nDCG@5 figure is the average across the 24 datasets, and direct readers to the Experiments section for per-dataset breakdowns, statistical significance tests, and variance measures. Abstract length constraints will be respected while adding these key qualifiers. revision: yes

  2. Referee: [Method] Method section: The fusion of the 2D position prior into hierarchical clustering on patch embeddings is presented as ensuring spatial-semantic coherence, yet no ablation removing the 2D prior is reported, leaving the contribution of this component unisolated and the preservation of global context (e.g., table structures or cross-column references) unverified.

    Authors: The 2D position prior is fused with patch embeddings to enforce spatial-semantic coherence during hierarchical clustering, which is intended to help maintain document-level structures. We acknowledge that the original submission did not include an ablation isolating this component. We will add such an ablation (with/without the 2D prior) to the revised Experiments section and include qualitative examples illustrating preservation of structures such as tables and cross-column references. revision: yes

  3. Referee: [Experiments] Experiments: No comparisons against simpler fixed-size or purely semantic chunking baselines are included to demonstrate that the proposed hierarchical clustering with position prior is necessary for the reported gains rather than achievable by less complex alternatives.

    Authors: The core evaluation compares ColChunk against single-vector models to highlight storage reduction while retaining multi-vector accuracy. We recognize that direct comparisons to simpler chunking strategies would better isolate the value of hierarchical clustering plus the 2D prior. We will therefore add fixed-size chunking and purely semantic clustering baselines to the revised Experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical plug-and-play method with no derivation chain

full rationale

The paper describes ColChunk as an empirical framework that applies hierarchical clustering to patch-level embeddings fused with a 2D position prior, then reports storage and nDCG@5 results from evaluations on 24 datasets. No equations, first-principles derivations, or predictions appear in the provided text. Claims rest on external experimental outcomes rather than any self-definitional constructs, fitted inputs renamed as predictions, or load-bearing self-citations. The work is self-contained as a practical method without reducing its reported gains to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view shows no explicit free parameters, axioms, or invented entities; the method assumes standard hierarchical clustering behaves well when augmented with spatial priors, but details are absent.

pith-pipeline@v0.9.0 · 5474 in / 1029 out tokens · 43472 ms · 2026-05-10T15:21:00.677035+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 30 canonical work pages · 5 internal anchors

  1. [1]

    Sungguk Cha, DongWook Kim, Mintae Kim, Youngsub Han, Byoung-Ki Jeon, and Sangyeob Lee. 2026. ReinPool: Reinforcement Learning Pooling Multi-Vector Embeddings for Retrieval System.arXiv preprint arXiv:2601.07125(2026)

  2. [2]

    Benjamin Clavié, Antoine Chaffin, and Griffin Adams. 2024. Reducing the foot- print of multi-vector retrieval with minimal performance impact via token pool- ing.arXiv preprint arXiv:2409.14683(2024)

  3. [3]

    Max Conti, Manuel Faysse, Gautier Viaud, Antoine Bosselut, Céline Hudelot, and Pierre Colombo. 2025. Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings.arXiv preprint arXiv:2505.24782 (2025)

  4. [4]

    Yihao Ding, Siwen Luo, Yue Dai, Yanbei Jiang, Zechuan Li, Geoffrey Martin, and Yifan Peng. 2025. A Survey on MLLM-based Visually Rich Document Understand- ing: Methods, Challenges, and Emerging Trends.arXiv preprint arXiv:2507.09861 (2025)

  5. [5]

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. 2024. Colpali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449(2024)

  6. [6]

    Sensen Gao, Shanshan Zhao, Xu Jiang, Lunhao Duan, Yong Xien Chng, Qing-Guo Chen, Weihua Luo, Kaifu Zhang, Jia-Wang Bian, and Mingming Gong. 2025. Scal- ing Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding.arXiv preprint arXiv:2510.15253(2025)

  7. [7]

    Tiancheng Gu, Kaicheng Yang, Kaichen Zhang, Xiang An, Ziyong Feng, Yueyi Zhang, Weidong Cai, Jiankang Deng, and Lidong Bing. 2025. UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning. arXiv:2510.13515 [cs.CV] https://arxiv.org/abs/2510.13515

  8. [8]

    Michael Günther, Isabelle Mohr, Daniel James Williams, Bo Wang, and Han Xiao. 2024. Late chunking: contextual chunk embeddings using long-context embedding models.arXiv preprint arXiv:2409.04701(2024)

  9. [9]

    Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, et al. 2025. jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval.arXiv preprint arXiv:2506.18902(2025)

  10. [10]

    Shanxiu He, Mutasem Al-Darabsah, Suraj Nair, Jonathan May, Tarun Agarwal, Tao Yang, and Choon Hui Teo. 2025. Token pruning optimization for efficient multi-vector dense retrieval. InEuropean Conference on Information Retrieval. Springer, 101–115

  11. [11]

    Jiahao Huo, Yu Huang, Yibo Yan, Ye Pan, Yi Cao, Mingdong Ou, Philip S Yu, and Xuming Hu. 2026. CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding.arXiv preprint arXiv:2601.21262 (2026)

  12. [12]

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. 2024. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2410.05160(2024)

  13. [13]

    Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48

  14. [14]

    Carlos Lassance, Maroua Maachou, Joohee Park, and Stéphane Clinchant. 2022. Learned token pruning in contextualized late interaction over BERT (ColBERT). InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2232–2236

  15. [15]

    Jinhyuk Lee, Zhuyun Dai, Sai Meher Karthik Duddu, Tao Lei, Iftekhar Naim, Ming-Wei Chang, and Vincent Zhao. 2023. Rethinking the role of token retrieval in multi-vector retrieval.Advances in Neural Information Processing Systems36 (2023), 15384–15405

  16. [16]

    Qi Liu and Jiaxin Mao. 2023. Understanding the Multi-vector Dense Retrieval Models. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management. 4110–4114

  17. [17]

    Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. 2025. Lamra: Large multimodal model as your advanced retrieval assistant. InProceedings of the Computer Vision and Pattern Recognition Conference. 4015–4025

  18. [18]

    Zheng Liu, Ze Liu, Zhengyang Liang, Junjie Zhou, Shitao Xiao, Chao Gao, Chen Ja- son Zhang, and Defu Lian. 2025. Any information is just worth one single screen- shot: Unifying search with visualized information retrieval. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 19238–19261

  19. [19]

    Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin

  20. [20]

    InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.)

    Unifying Multimodal Retrieval via Document Screenshot Embedding. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 6492–6505. doi:10.18653/v1/2024.emnlp-main.373

  21. [21]

    Yubo Ma, Jinsong Li, Yuhang Zang, Xiaobao Wu, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Jiaqi Wang, Yixin Cao, et al. 2025. Towards Storage- Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings.arXiv preprint arXiv:2506.04997(2025)

  22. [22]

    Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. 2024. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems37 (2024), 95963–96010

  23. [23]

    Sean MacAvaney, Antonio Mallia, and Nicola Tonellotto. 2025. Efficient Constant- Space Multi-vector Retrieval. InEuropean Conference on Information Retrieval. Springer, 237–245

  24. [24]

    Quentin Macé, António Loison, and Manuel Faysse. 2025. ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval.arXiv preprint arXiv:2505.17166(2025)

  25. [25]

    Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, et al . 2025. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.arXiv preprint arXiv:2507.04590(2025)

  26. [26]

    Carlo Merola and Jaspinder Singh. 2025. Reconstructing context: Evaluating ad- vanced chunking strategies for retrieval-augmented generation. InInternational Workshop on Knowledge-Enhanced Information Retrieval. Springer, 3–18

  27. [27]

    Gabriel de Souza P Moreira, Ronay Ak, Mengyao Xu, Oliver Holworthy, Benedikt Schifferer, Zhiding Yu, Yauhen Babakhin, Radek Osmulski, Jiarui Cai, Ryan Chesler, et al. 2026. Nemotron ColEmbed V2: Top-Performing Late Interaction embedding models for Visual Document Retrieval.arXiv preprint arXiv:2602.03992 (2026)

  28. [28]

    Alexander Most, Joseph Winjum, Manish Bhattarai, Shawn Jones, Nishath Rajiv Ranasinghe, Ayan Biswas, and Dan O’Malley. 2025. Lost in ocr translation? vision-based approaches to robust document retrieval. InProceedings of the 2025 ACM Symposium on Document Engineering. 1–10

  29. [29]

    Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, et al. 2025. Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing. arXiv preprint arXiv:2509.22186(2025)

  30. [30]

    Cheoneum Park, Seohyeong Jeong, Minsang Kim, KyungTae Lim, and Yong- Hun Lee. 2025. SCV: Light and Effective Multi-Vector Retrieval with Sequence Compressive Vectors. InProceedings of the 31st International Conference on Com- putational Linguistics: Industry Track. 760–770

  31. [31]

    Yujie Qian, Jinhyuk Lee, Sai Meher Karthik Duddu, Zhuyun Dai, Siddhartha Brahma, Iftekhar Naim, Tao Lei, and Vincent Y Zhao. 2022. Multi-vector retrieval as sparse alignment.arXiv preprint arXiv:2211.01267(2022)

  32. [32]

    Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. Colbertv2: Effective and efficient retrieval via lightweight late interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3715–3734

  33. [33]

    Jan Luca Scheerer, Matei Zaharia, Christopher Potts, Gustavo Alonso, and Omar Khattab. 2025. WARP: An efficient engine for multi-vector retrieval. InProceedings of the 48th international ACM SIGIR conference on research and development in information retrieval. 2504–2512

  34. [34]

    Susav Shrestha, Narasimha Reddy, and Zongwang Li. 2024. Espn: Memory- efficient multi-vector information retrieval. InProceedings of the 2024 ACM SIG- PLAN International Symposium on Memory Management. 95–107

  35. [35]

    Paul Teiletche, Quentin Macé, Max Conti, Antonio Loison, Gautier Viaud, Pierre Colombo, and Manuel Faysse. 2025. ModernVBERT: Towards Smaller Visual Document Retrievers.arXiv preprint arXiv:2510.01149(2025)

  36. [36]

    Naftali Tishby, Fernando C Pereira, and William Bialek. 2000. The information bottleneck method.arXiv preprint physics/0004057(2000)

  37. [37]

    João Veneroso, Rajesh Jayaram, Jinmeng Rao, Gustavo Hernández Ábrego, Majid Hadian, and Daniel Cer. 2025. CRISP: Clustering Multi-Vector Representations for Denoising and Pruning.arXiv preprint arXiv:2505.11471(2025)

  38. [38]

    Feng Wang, Yuqing Li, and Han Xiao. 2025. Jina-reranker-v3: Last but Not Late Interaction for Listwise Document Reranking.arXiv preprint arXiv:2509.25085 (2025)

  39. [39]

    Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, and Feng Zhao. 2025. Vidorag: Visual document retrieval-augmented generation via dynamic iterative reasoning agents.arXiv preprint arXiv:2502.18017(2025)

  40. [40]

    Zilin Xiao, Qi Ma, Mengting Gu, Chun-cheng Jason Chen, Xintao Chen, Vicente Ordonez, and Vijai Mohan. 2025. Metaembed: Scaling multimodal retrieval at test-time with flexible late interaction.arXiv preprint arXiv:2509.18095(2025)

  41. [41]

    Yibo Yan, Jiahao Huo, Guanbo Feng, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu, Yuanhuiyi Lyu, Yu Huang, Jungang Li, et al. 2026. Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval.arXiv preprint arXiv:2602.19961(2026)

  42. [42]

    Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Jiahao Huo, Shuliang Liu, James Kwok, and Xuming Hu. 2026. Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework.arXiv preprint arXiv:2602.19549(2026)

  43. [43]

    Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu, Jiahao Huo, Yu Huang, James Kwok, and Xuming Hu. 2026. Beyond the Grid: Layout-Informed Multi- Vector Retrieval with Parsed Visual Document Representations.arXiv preprint Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al. arXiv:2603.01666(2026)

  44. [44]

    Yibo Yan, Guangwei Xu, Xin Zou, Shuliang Liu, James Kwok, and Xuming Hu

  45. [45]

    Docpruner: A storage-efficient framework for multi-vector visual doc- ument retrieval via adaptive patch-level embedding pruning.arXiv preprint arXiv:2509.23883(2025)

  46. [46]

    Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zheng- hao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al . 2024. Visrag: Vision-based retrieval-augmented generation on multi-modality documents.arXiv preprint arXiv:2410.10594(2024)

  47. [47]

    Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, and Wentao Zhang. 2025. Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 17443–17453

  48. [48]

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. 2024. GME: Im- proving Universal Multimodal Retrieval by Multimodal LLMs.arXiv preprint arXiv:2412.16855(2024)