Recognition: unknown
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
Pith reviewed 2026-05-10 15:21 UTC · model grok-4.3
The pith
ColChunk uses hierarchical clustering on image patch embeddings with a 2D position prior to create contextual chunks that cut storage by over 90 percent while raising retrieval accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ColChunk introduces multimodal late chunking that performs hierarchical clustering on patch-level embeddings fused with a 2D position prior. This produces spatially and semantically coherent chunks that preserve global context while reducing the total number of vectors. The approach avoids fixed-token or pruning methods and instead adapts groupings to content. On 24 visual document retrieval datasets it yields over 90 percent storage reduction together with an average 9-point gain in nDCG@5 for representative single-vector models.
What carries the argument
ColChunk framework performing hierarchical clustering on patch-level embeddings fused with a 2D position prior to generate adaptive, content-aware chunks.
Load-bearing premise
Hierarchical clustering on patch-level embeddings fused with a 2D position prior will reliably produce spatially and semantically coherent chunks that preserve global context without losing critical retrieval information.
What would settle it
Testing ColChunk on a collection of images whose patch embeddings lack semantic structure, such as random noise patterns, and finding that retrieval scores drop or stay flat instead of improving would show the clustering step fails to maintain necessary information.
Figures
read the original abstract
Multi-vector models dominate Visual Document Retrieval (VDR) due to their fine-grained matching capabilities, but their high storage and computational costs present a major barrier to practical deployment. In this paper, we propose ColChunk, a plug-and-play framework that introduces multimodal late chunking to construct efficient, contextualized multi-vectors. Unlike existing pruning or fixed-token approaches, ColChunk employs hierarchical clustering on patch-level embeddings, fused with a 2D position prior to ensure spatial-semantic coherence. This adaptive grouping allows for a content-aware representation that preserves global context while drastically reducing the vector count. Evaluations across 24 VDR datasets demonstrate ColChunk achieves over a 90% reduction in storage requirements while simultaneously delivering a 9-point average improvement in nDCG@5 across representative single-vector models. ColChunk provides a practical solution for balancing retrieval accuracy and efficiency in visual document systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ColChunk, a plug-and-play multimodal late chunking framework for visual document retrieval. It uses hierarchical clustering on patch-level embeddings fused with a 2D position prior to produce contextualized multi-vectors, claiming over 90% storage reduction and a 9-point average nDCG@5 improvement across 24 VDR datasets relative to single-vector models.
Significance. If the empirical results prove robust, the work would be significant for practical VDR deployment by reducing the storage burden of multi-vector models while reporting accuracy gains. The scale of evaluation across 24 datasets is a strength, but the absence of detailed controls limits the ability to assess generalizability.
major comments (3)
- [Abstract] Abstract: The central claims of >90% storage reduction and 9-point nDCG@5 gain are stated without any reference to the specific single-vector baselines, per-dataset breakdowns, statistical significance tests, or variance measures, which are required to substantiate the average improvement and efficiency results.
- [Method] Method section: The fusion of the 2D position prior into hierarchical clustering on patch embeddings is presented as ensuring spatial-semantic coherence, yet no ablation removing the 2D prior is reported, leaving the contribution of this component unisolated and the preservation of global context (e.g., table structures or cross-column references) unverified.
- [Experiments] Experiments: No comparisons against simpler fixed-size or purely semantic chunking baselines are included to demonstrate that the proposed hierarchical clustering with position prior is necessary for the reported gains rather than achievable by less complex alternatives.
minor comments (1)
- The abstract and method would benefit from a brief illustrative example or diagram showing chunk formation on a multi-column or tabular document to clarify how global context is preserved.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify how to better substantiate our claims. We address each major comment below and will incorporate revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of >90% storage reduction and 9-point nDCG@5 gain are stated without any reference to the specific single-vector baselines, per-dataset breakdowns, statistical significance tests, or variance measures, which are required to substantiate the average improvement and efficiency results.
Authors: We agree that the abstract would be strengthened by greater specificity. In the revised version, we will update the abstract to explicitly name the single-vector baselines (e.g., the representative models evaluated), state that the 9-point nDCG@5 figure is the average across the 24 datasets, and direct readers to the Experiments section for per-dataset breakdowns, statistical significance tests, and variance measures. Abstract length constraints will be respected while adding these key qualifiers. revision: yes
-
Referee: [Method] Method section: The fusion of the 2D position prior into hierarchical clustering on patch embeddings is presented as ensuring spatial-semantic coherence, yet no ablation removing the 2D prior is reported, leaving the contribution of this component unisolated and the preservation of global context (e.g., table structures or cross-column references) unverified.
Authors: The 2D position prior is fused with patch embeddings to enforce spatial-semantic coherence during hierarchical clustering, which is intended to help maintain document-level structures. We acknowledge that the original submission did not include an ablation isolating this component. We will add such an ablation (with/without the 2D prior) to the revised Experiments section and include qualitative examples illustrating preservation of structures such as tables and cross-column references. revision: yes
-
Referee: [Experiments] Experiments: No comparisons against simpler fixed-size or purely semantic chunking baselines are included to demonstrate that the proposed hierarchical clustering with position prior is necessary for the reported gains rather than achievable by less complex alternatives.
Authors: The core evaluation compares ColChunk against single-vector models to highlight storage reduction while retaining multi-vector accuracy. We recognize that direct comparisons to simpler chunking strategies would better isolate the value of hierarchical clustering plus the 2D prior. We will therefore add fixed-size chunking and purely semantic clustering baselines to the revised Experiments section. revision: yes
Circularity Check
No circularity: purely empirical plug-and-play method with no derivation chain
full rationale
The paper describes ColChunk as an empirical framework that applies hierarchical clustering to patch-level embeddings fused with a 2D position prior, then reports storage and nDCG@5 results from evaluations on 24 datasets. No equations, first-principles derivations, or predictions appear in the provided text. Claims rest on external experimental outcomes rather than any self-definitional constructs, fitted inputs renamed as predictions, or load-bearing self-citations. The work is self-contained as a practical method without reducing its reported gains to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
- [2]
- [3]
-
[4]
Yihao Ding, Siwen Luo, Yue Dai, Yanbei Jiang, Zechuan Li, Geoffrey Martin, and Yifan Peng. 2025. A Survey on MLLM-based Visually Rich Document Understand- ing: Methods, Challenges, and Emerging Trends.arXiv preprint arXiv:2507.09861 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [5]
-
[6]
Sensen Gao, Shanshan Zhao, Xu Jiang, Lunhao Duan, Yong Xien Chng, Qing-Guo Chen, Weihua Luo, Kaifu Zhang, Jia-Wang Bian, and Mingming Gong. 2025. Scal- ing Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding.arXiv preprint arXiv:2510.15253(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [7]
- [8]
-
[9]
Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, et al. 2025. jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval.arXiv preprint arXiv:2506.18902(2025)
-
[10]
Shanxiu He, Mutasem Al-Darabsah, Suraj Nair, Jonathan May, Tarun Agarwal, Tao Yang, and Choon Hui Teo. 2025. Token pruning optimization for efficient multi-vector dense retrieval. InEuropean Conference on Information Retrieval. Springer, 101–115
2025
-
[11]
Jiahao Huo, Yu Huang, Yibo Yan, Ye Pan, Yi Cao, Mingdong Ou, Philip S Yu, and Xuming Hu. 2026. CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding.arXiv preprint arXiv:2601.21262 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [12]
-
[13]
Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48
2020
-
[14]
Carlos Lassance, Maroua Maachou, Joohee Park, and Stéphane Clinchant. 2022. Learned token pruning in contextualized late interaction over BERT (ColBERT). InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2232–2236
2022
-
[15]
Jinhyuk Lee, Zhuyun Dai, Sai Meher Karthik Duddu, Tao Lei, Iftekhar Naim, Ming-Wei Chang, and Vincent Zhao. 2023. Rethinking the role of token retrieval in multi-vector retrieval.Advances in Neural Information Processing Systems36 (2023), 15384–15405
2023
-
[16]
Qi Liu and Jiaxin Mao. 2023. Understanding the Multi-vector Dense Retrieval Models. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management. 4110–4114
2023
-
[17]
Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. 2025. Lamra: Large multimodal model as your advanced retrieval assistant. InProceedings of the Computer Vision and Pattern Recognition Conference. 4015–4025
2025
-
[18]
Zheng Liu, Ze Liu, Zhengyang Liang, Junjie Zhou, Shitao Xiao, Chao Gao, Chen Ja- son Zhang, and Defu Lian. 2025. Any information is just worth one single screen- shot: Unifying search with visualized information retrieval. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 19238–19261
2025
-
[19]
Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin
-
[20]
Unifying Multimodal Retrieval via Document Screenshot Embedding. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 6492–6505. doi:10.18653/v1/2024.emnlp-main.373
- [21]
-
[22]
Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. 2024. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems37 (2024), 95963–96010
2024
-
[23]
Sean MacAvaney, Antonio Mallia, and Nicola Tonellotto. 2025. Efficient Constant- Space Multi-vector Retrieval. InEuropean Conference on Information Retrieval. Springer, 237–245
2025
- [24]
- [25]
-
[26]
Carlo Merola and Jaspinder Singh. 2025. Reconstructing context: Evaluating ad- vanced chunking strategies for retrieval-augmented generation. InInternational Workshop on Knowledge-Enhanced Information Retrieval. Springer, 3–18
2025
-
[27]
Gabriel de Souza P Moreira, Ronay Ak, Mengyao Xu, Oliver Holworthy, Benedikt Schifferer, Zhiding Yu, Yauhen Babakhin, Radek Osmulski, Jiarui Cai, Ryan Chesler, et al. 2026. Nemotron ColEmbed V2: Top-Performing Late Interaction embedding models for Visual Document Retrieval.arXiv preprint arXiv:2602.03992 (2026)
-
[28]
Alexander Most, Joseph Winjum, Manish Bhattarai, Shawn Jones, Nishath Rajiv Ranasinghe, Ayan Biswas, and Dan O’Malley. 2025. Lost in ocr translation? vision-based approaches to robust document retrieval. InProceedings of the 2025 ACM Symposium on Document Engineering. 1–10
2025
- [29]
-
[30]
Cheoneum Park, Seohyeong Jeong, Minsang Kim, KyungTae Lim, and Yong- Hun Lee. 2025. SCV: Light and Effective Multi-Vector Retrieval with Sequence Compressive Vectors. InProceedings of the 31st International Conference on Com- putational Linguistics: Industry Track. 760–770
2025
- [31]
-
[32]
Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. Colbertv2: Effective and efficient retrieval via lightweight late interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3715–3734
2022
-
[33]
Jan Luca Scheerer, Matei Zaharia, Christopher Potts, Gustavo Alonso, and Omar Khattab. 2025. WARP: An efficient engine for multi-vector retrieval. InProceedings of the 48th international ACM SIGIR conference on research and development in information retrieval. 2504–2512
2025
-
[34]
Susav Shrestha, Narasimha Reddy, and Zongwang Li. 2024. Espn: Memory- efficient multi-vector information retrieval. InProceedings of the 2024 ACM SIG- PLAN International Symposium on Memory Management. 95–107
2024
- [35]
-
[36]
Naftali Tishby, Fernando C Pereira, and William Bialek. 2000. The information bottleneck method.arXiv preprint physics/0004057(2000)
work page Pith review arXiv 2000
- [37]
- [38]
- [39]
-
[40]
Zilin Xiao, Qi Ma, Mengting Gu, Chun-cheng Jason Chen, Xintao Chen, Vicente Ordonez, and Vijai Mohan. 2025. Metaembed: Scaling multimodal retrieval at test-time with flexible late interaction.arXiv preprint arXiv:2509.18095(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Yibo Yan, Jiahao Huo, Guanbo Feng, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu, Yuanhuiyi Lyu, Yu Huang, Jungang Li, et al. 2026. Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval.arXiv preprint arXiv:2602.19961(2026)
-
[42]
Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Jiahao Huo, Shuliang Liu, James Kwok, and Xuming Hu. 2026. Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework.arXiv preprint arXiv:2602.19549(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[43]
Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu, Jiahao Huo, Yu Huang, James Kwok, and Xuming Hu. 2026. Beyond the Grid: Layout-Informed Multi- Vector Retrieval with Parsed Visual Document Representations.arXiv preprint Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al. arXiv:2603.01666(2026)
-
[44]
Yibo Yan, Guangwei Xu, Xin Zou, Shuliang Liu, James Kwok, and Xuming Hu
- [45]
- [46]
-
[47]
Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, and Wentao Zhang. 2025. Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval-augmented generation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 17443–17453
2025
- [48]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.