Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization

Ariel Gera; Asaf Yehudai; Eyal Shnarch; Omri Uzan; Roi pony

arxiv: 2510.05038 · v3 · submitted 2025-10-06 · 💻 cs.CL

Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization

Omri Uzan , Asaf Yehudai , Roi pony , Eyal Shnarch , Ariel Gera This is my paper

Pith reviewed 2026-05-18 09:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal retrievalhybrid retrievaltest-time optimizationquery embedding refinementvisual document retrievalvision-language modelsretrieval efficiency

0 comments

The pith

Guided Query Refinement refines a vision-centric model's query embedding at test time using scores from a lightweight text retriever to match the accuracy of much larger models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Guided Query Refinement as a test-time method that adjusts the query embedding of a primary vision-centric retriever by drawing on ranking signals from a simpler dense text retriever. This targets the scaling problems of large multimodal representations in visual document retrieval, where current models demand heavy compute and memory. A sympathetic reader would care because the approach promises to close performance gaps without increasing model size, training, or representation dimensions. Experiments across benchmarks show the refined systems reaching parity with larger models while delivering major gains in speed and memory use.

Core claim

Guided Query Refinement is a test-time optimization procedure that refines the query embedding of a primary vision-centric retriever by leveraging guidance signals derived from the ranking scores of a complementary lightweight dense text retriever. This hybrid approach exploits rich interactions within each model's representation space rather than relying on coarse-grained fusion of ranks or scores. The result is that vision-centric models reach performance levels comparable to those relying on significantly larger representations.

What carries the argument

Guided Query Refinement (GQR), a test-time optimization that adjusts the primary query embedding using scores from a complementary retriever to improve hybrid retrieval without per-query hyperparameter search.

If this is right

Vision-centric models achieve performance comparable to models with significantly larger representations on visual document retrieval benchmarks.
Retrieval runs up to 14x faster and uses 54x less memory than the larger-representation alternatives.
The Pareto frontier for performance versus efficiency advances in multimodal retrieval systems.
Hybrid retrieval benefits from embedding-level refinement instead of post-hoc rank or score fusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The test-time guidance idea could extend to other retrieval settings where one modality or model type offers cheap signals to refine a stronger but heavier primary system.
By improving smaller models dynamically, the method may reduce pressure to scale representations indefinitely for new tasks.
GQR-style refinement might combine with existing efficiency techniques such as quantization or pruning to further ease real-world deployment.

Load-bearing premise

Scores from the lightweight dense text retriever supply reliable, non-conflicting guidance that refines the primary query embedding effectively without per-query hyperparameter search or degradation in edge cases.

What would settle it

A visual document retrieval benchmark where applying GQR either lowers accuracy relative to the base vision-centric model or requires query-specific hyperparameter adjustments to produce gains.

Figures

Figures reproduced from arXiv: 2510.05038 by Ariel Gera, Asaf Yehudai, Eyal Shnarch, Omri Uzan, Roi pony.

**Figure 1.** Figure 1: Hybrid retrieval methods. Aggregating the outputs of two retrievers is typically done at the level of ranks (§2.1) or scores (§2.2). Utilizing the information of both representations effectively and efficiently is difficult to achieve in practice. Here we propose a novel approach of Guided Query Refinement (GQR), using similarity scores from an complementary retriever (left) at test time, to inform the que… view at source ↗

**Figure 2.** Figure 2: Guided Query Refinement (GQR). Stage 1: Two retrievers independently encode the query and retrieve top-K documents, forming a candidate pool. Stage 2: The primary query embedding is iteratively refined (z (t) ) over T iterations, by minimizing the KL divergence between a consensus distribution and the primary distribution. 2.4 GQR - MOTIVATION AND RATIONALE Our approach is inspired by test time optimizati… view at source ↗

**Figure 3.** Figure 3: Latency–quality tradeoff in online querying. The [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Online latency breakdown of GQR for T = 25 and T = 50. higher performance. The smallest (10−5 ) and largest (5 × 10−3 ) learning rates are suboptimal, where the latter even results in performance degradation relative to the primary retriever. The results capture a tradeoff between latency and stability. Higher learning rates can provide a performance boost faster, but might deteriorate quickly past a certa… view at source ↗

**Figure 6.** Figure 6: Latency–quality tradeoff in online querying. The [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Baseline results on ViDoRe 2 across different values of the weight [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Query-level dynamics of GQR versus score aggregation. The heat maps depict examples [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Storage–quality tradeoff. The x axis is memory in MB, on a log scale, and the y axis is the average evaluation score (NDCG@5). Marker color encodes the primary retriever; marker shape encodes the GQR complementary retriever, with squares indicating the primary retriever alone (without applying GQR) [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

read the original abstract

Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual query tokens directly to image patches and achieving state-of-the-art performance on public benchmarks. Recent models relying on this paradigm have massively scaled the sizes of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines. Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited by modern vision-language models. In this work, we connect these challenges to the paradigm of hybrid retrieval, investigating whether a lightweight dense text retriever can enhance a stronger vision-centric model. Existing hybrid methods, which rely on coarse-grained fusion of ranks or scores, fail to exploit the rich interactions within each model's representation space. To address this, we introduce Guided Query Refinement (GQR), a novel test-time optimization method that refines a primary retriever's query embedding using guidance from a complementary retriever's scores. Through extensive experiments on visual document retrieval benchmarks, we demonstrate that GQR allows vision-centric models to match the performance of models with significantly larger representations, while being up to 14x faster and requiring 54x less memory. Our findings show that GQR effectively pushes the Pareto frontier for performance and efficiency in multimodal retrieval. We release our code at https://github.com/IBM/test-time-hybrid-retrieval

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GQR refines vision query embeddings at test time using a cheap text retriever's scores, which helps smaller models close the gap to larger ones on visual document tasks.

read the letter

The main point is that this work takes hybrid retrieval beyond rank or score fusion by doing a targeted test-time optimization on the query embedding itself, guided by the complementary model's outputs. That procedural step is the concrete addition over standard multimodal scaling approaches mentioned in the abstract. The experiments on visual document benchmarks show the smaller vision model reaching comparable retrieval quality while cutting memory and inference time substantially, and the code release is a plus for anyone wanting to check the implementation. The method stays grounded in external benchmarks rather than self-referential fitting, which keeps the claims falsifiable. The main soft spot is the efficiency accounting. The reported 14x speedup and 54x memory savings assume the test-time steps add only negligible cost, but an iterative refinement loop with score evaluations or embedding updates could accumulate per-query latency that eats into those margins, especially if the number of steps varies. The paper positions the approach as practical and largely hyperparameter-free, so missing a clear breakdown of wall-clock time or ablation on optimization depth leaves the Pareto claim partly unverified. This is useful for retrieval engineers who already run hybrid setups and want to squeeze more out of existing vision encoders without retraining. It has a clear enough method and empirical angle to deserve referee time rather than a desk reject, though the review should focus on runtime measurements and edge-case robustness of the guidance signal.

Referee Report

1 major / 1 minor

Summary. The paper introduces Guided Query Refinement (GQR), a test-time optimization method that refines the query embedding of a primary vision-centric multimodal retriever using score-based guidance from a lightweight dense text retriever. It claims that this hybrid approach enables smaller vision-centric models to match the performance of models with significantly larger representations on visual document retrieval benchmarks, while achieving up to 14x speedup and 54x memory reduction, thereby advancing the performance-efficiency Pareto frontier in multimodal retrieval.

Significance. If the efficiency and performance claims hold after accounting for test-time costs, the work would meaningfully advance hybrid retrieval by moving beyond coarse rank/score fusion to representation-space guidance, offering a practical deployment path for high-performing vision-centric models without requiring larger representations.

major comments (1)

[Abstract] Abstract and the description of the GQR test-time optimization procedure: the reported 14x speedup and 54x memory savings versus larger-representation baselines do not include or bound the per-query cost of the iterative refinement (forward passes, score evaluations, or optimization steps). Without explicit measurement or amortization of this overhead relative to base inference, the net efficiency gains and the claim that GQR matches larger models while remaining faster cannot be verified from the stated results.

minor comments (1)

[Abstract] The abstract refers to 'extensive experiments on visual document retrieval benchmarks' but does not name the specific datasets, baselines, or statistical tests used to support the performance-matching claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the efficiency claims. We address the major comment regarding the inclusion of test-time optimization costs below.

read point-by-point responses

Referee: [Abstract] Abstract and the description of the GQR test-time optimization procedure: the reported 14x speedup and 54x memory savings versus larger-representation baselines do not include or bound the per-query cost of the iterative refinement (forward passes, score evaluations, or optimization steps). Without explicit measurement or amortization of this overhead relative to base inference, the net efficiency gains and the claim that GQR matches larger models while remaining faster cannot be verified from the stated results.

Authors: We acknowledge that the speedup and memory figures reported in the abstract compare the base inference costs of the smaller vision-centric model (augmented by GQR) to those of larger-representation baselines, without an explicit accounting of the per-query overhead from the iterative refinement procedure. The manuscript focuses on the final retrieval latency after refinement but does not provide per-step timing or bounds on the number of optimization iterations. We agree this omission limits verification of net gains. In the revised manuscript we will add (i) measured wall-clock time per refinement step on the evaluation hardware, (ii) the average and maximum number of steps observed across queries, and (iii) a combined latency figure that includes the full GQR procedure. We will also discuss amortization when GQR is applied to batches or when the number of steps remains small relative to the representation-size savings. These additions will allow readers to assess whether the claimed efficiency advantages hold after test-time costs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method defined procedurally and validated on external benchmarks

full rationale

The paper introduces Guided Query Refinement (GQR) as a test-time optimization procedure that refines a primary vision-centric model's query embedding using guidance from scores of a complementary lightweight dense text retriever. This is presented as a novel hybrid retrieval technique to address modality gaps and scalability issues. Performance claims (matching larger models while being faster and more memory-efficient) and efficiency assertions are supported exclusively by empirical results on visual document retrieval benchmarks, with no equations, derivations, or fitted parameters shown that reduce the reported gains to quantities defined solely by the method's own inputs or self-referential normalizations. No load-bearing self-citations or uniqueness theorems from overlapping authors are invoked in the provided text to justify the core approach. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about retriever complementarity and the effectiveness of gradient-based test-time updates; no new physical entities are postulated and free parameters are limited to typical optimization hyperparameters whose exact values are not detailed in the abstract.

free parameters (1)

test-time optimization hyperparameters
Steps, learning rate, or stopping criteria for the refinement optimization are required to run GQR but are not quantified in the abstract.

axioms (1)

domain assumption Scores from the complementary text retriever provide useful guidance for query refinement
Invoked when describing how GQR exploits interactions within representation spaces to improve the primary vision-centric model.

pith-pipeline@v0.9.0 · 5778 in / 1315 out tokens · 59644 ms · 2026-05-18T09:38:32.120158+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
cs.LG 2026-05 unverdicted novelty 7.0

Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
cs.LG 2026-05 unverdicted novelty 7.0

A softmax-weighted centroid of the local top-K documents interpolated with the query improves nDCG@10 for frozen embedding models across seven families on held-out BEIR data.
Task-Adaptive Embedding Refinement via Test-time LLM Guidance
cs.CL 2026-05 unverdicted novelty 6.0

Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.
Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
cs.CV 2026-04 unverdicted novelty 6.0

Rewrite-driven generation with alignment and RL produces shorter, more effective generative multimodal embeddings than CoT methods on retrieval benchmarks.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 3 Pith papers · 5 internal anchors

[3]

PaliGemma: A versatile 3B VLM for transfer

URL https://arxiv.org/abs/2407.07726. Sebastian Bruch, Siyu Gai, and Amir Ingber. An analysis of fusion functions for hybrid retrieval. ACM Transactions on Information Systems, 42(1), August

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Antoine Chaffin and Aur´elien Lac

URLhttps://doi.org/ 10.1145/3596512. Antoine Chaffin and Aur´elien Lac. Monoqwen: Visual document reranking,

work page doi:10.1145/3596512
[5]

Tao Chen, Mingyang Zhang, Jing Lu, Michael Bendersky, and Marc Najork

URLhttps: //huggingface.co/lightonai/MonoQwen2-VL-v0.1. Tao Chen, Mingyang Zhang, Jing Lu, Michael Bendersky, and Marc Najork. Out-of-domain se- mantics to the rescue! Zero-shot hybrid retrieval models. InAdvances in Information Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part I, pp. 95...

work page 2022
[6]

ISBN 978-3-030- 99735-9

Springer-Verlag. ISBN 978-3-030- 99735-9. URLhttps://doi.org/10.1007/978-3-030-99736-6_7. Chanyeol Choi, Junseong Kim, Seolhwa Lee, Jihoon Kwon, Sangmo Gu, Yejin Kim, Minkyung Cho, and Jy-yong Sohn. Linq-embed-mistral technical report.arXiv:2412.03223,

work page doi:10.1007/978-3-030-99736-6_7
[7]

arXiv preprint arXiv:2412.03223

URL https://arxiv.org/abs/2412.03223. Benjamin Clavi´e and Florian Brand. Readbench: Measuring the dense text visual reading ability of vision-language models,

work page arXiv
[8]

Gordon V

URLhttps://arxiv.org/abs/2505.19091. Gordon V . Cormack, Charles L A Clarke, and Stefan Buettcher. Reciprocal rank fusion outper- forms condorcet and individual rank learning methods. InSIGIR ’09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 758–759, New York, NY , USA,

work page arXiv
[9]

URLhttp://doi.acm.org/10.1145/1571941. 1572114. C´ıcero dos Santos, Luciano Barbosa, Dasha Bogdanova, and Bianca Zadrozny. Learning hybrid rep- resentations to retrieve semantically equivalent questions. In Chengqing Zong and Michael Strube (eds.),Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Internati...

work page doi:10.1145/1571941
[10]

ISBN 9798400715921

Association for Com- puting Machinery. ISBN 9798400715921. URLhttps://doi.org/10.1145/3726302. 3730160. Michael G ¨unther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, et al. jina-embeddings- v4: Universal embeddings for multimodal multilingual retrieval.arXiv:2506.18902,

work page doi:10.1145/3726302
[11]

11 Hsin-Ling Hsu and Jengnan Tzeng

URL https://arxiv.org/abs/2506.18902. 11 Hsin-Ling Hsu and Jengnan Tzeng. DAT: Dynamic alpha tuning for hybrid retrieval in retrieval- augmented generation.arXiv:2503.23013,

work page arXiv
[13]

Unsupervised Dense Information Retrieval with Contrastive Learning

URLhttps://arxiv.org/abs/2112.09118. Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learn- ing with noisy text supervision. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofICML, pp. 4904–4916. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih

URLhttps:// proceedings.mlr.press/v139/jia21b.html. Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.),Proceedings of the 2020 Conference on Empirical Methods in Natural Lan...

work page 2020
[15]

URLhttps://aclanthology.org/ 2020.emnlp-main.550/

Association for Computational Linguistics. URLhttps://aclanthology.org/ 2020.emnlp-main.550/. Omar Khattab and Matei Zaharia. ColBERT: Efficient and effective passage search via contextual- ized late interaction over BERT. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp. 39–48,

work page 2020
[17]

Binxu Li, Yuhui Zhang, Xiaohan Wang, Weixin Liang, Ludwig Schmidt, and Serena Yeung-Levy

URLhttps://arxiv.org/abs/2010.01195. Binxu Li, Yuhui Zhang, Xiaohan Wang, Weixin Liang, Ludwig Schmidt, and Serena Yeung-Levy. Closing the modality gap for mixed modality search,

work page arXiv 2010
[18]

Junnan Li, Ramprasaath R

URLhttps://arxiv.org/ abs/2507.19054. Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momentum distillation. InAdvances in Neural Information Processing Systems, NeurIPS,

work page arXiv
[19]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi

URLhttps://proceedings.neurips.cc/paper/2021/hash/ 505259756244493872b7709a8a01b536-Abstract.html. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. InInternational conference on machine learning, pp. 12888–12900. PMLR,

work page 2021
[20]

URL https://aclanthology.org/2024.acl-long.775

Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.775. Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. Sparse, dense, and attentional representations for text retrieval.Transactions of the Association for Computational Linguistics, 9:329–345,

work page 2024
[22]

Vidore benchmark v2: Raising the bar for visual retrieval.arXiv preprint arXiv:2505.17166, 2025

URLhttps://arxiv.org/abs/2505.17166. Minesh Mathew, Viraj Bagal, Rub `en P´erez Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V Jawahar. InfographicVQA, 2021a. URLhttps://arxiv.org/abs/2104.12756. Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for VQA on docu- ment images, 2021b. URLhttps://arxiv.org/abs/2007.00398. Rodrig...

work page arXiv 2007
[24]

Representation Learning with Contrastive Predictive Coding

URLhttps://arxiv.org/abs/ 1807.03748. Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering.arXiv preprint arXiv:2010.08191,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[25]

URLhttps: //arxiv.org/abs/2010.08191. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machin...

work page arXiv 2010
[26]

Nils Reimers and Iryna Gurevych

URLhttps://proceedings.mlr.press/v139/radford21a.html. Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT- networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.),Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th In- ternational Joint Conference on Natur...

work page 2019
[27]

Gerard Salton and Christopher Buckley

URLhttps://arxiv.org/abs/ 2505.03703. Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523,

work page arXiv
[29]

Colbertv2: Effective and efficient retrieval via lightweight late interaction

URLhttps://arxiv.org/abs/2112.01488. Keshav Santhanam, Omar Khattab, Christopher Potts, and Matei Zaharia. PLAID: an efficient engine for late interaction retrieval. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 1747–1756,

work page arXiv
[30]

Optimizing test-time query representations for dense retrieval

Mujeen Sung, Jungsoo Park, Jaewoo Kang, Danqi Chen, and Jinhyuk Lee. Optimizing test-time query representations for dense retrieval. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.),Findings of the Association for Computational Linguistics: ACL 2023, pp. 5731– 5746, Toronto, Canada, July

work page 2023
[31]

URLhttps: //aclanthology.org/2023.findings-acl.354/

Association for Computational Linguistics. URLhttps: //aclanthology.org/2023.findings-acl.354/. Ryota Tanaka, Kyosuke Nishida, and Sen Yoshida. Visualmrc: Machine reading comprehension on document images. InAAAI,

work page 2023
[32]

arXiv preprint arXiv:2408.09869 , year=

URLhttps: //arxiv.org/abs/2408.09869. 13 IBM Research Team. Granite-vision-3.3-2b-embedding, 2025a. URLhttps://huggingface. co/ibm-granite/granite-vision-3.3-2b-embedding. Nomic Team. Nomic embed multimodal: Interleaved text, image, and screenshots for visual document retrieval, 2025b. URLhttps://nomic.ai/blog/posts/ nomic-embed-multimodal. Peng Wang, Shu...

work page arXiv
[33]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

URLhttps://arxiv. org/abs/2409.12191. Navve Wasserman, Oliver Heinimann, Yuval Golbari, Tal Zimbalist, Eli Schwartz, and Michal Irani. Docrerank: Single-page hard negative query generation for training multi-modal rag rerankers. arXiv preprint arXiv:2505.22584,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk

URLhttps://arxiv.org/abs/2505.22584. Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval.arXiv preprint arXiv:2007.00808,

work page arXiv 2007
[35]

URLhttps://arxiv.org/abs/2007. 00808. Mengyao Xu, Gabriel Moreira, Ronay Ak, Radek Osmulski, Yauhen Babakhin, Zhiding Yu, Benedikt Schifferer, and Even Oldridge. Llama nemoretriever colembed: Top-performing text- image retrieval model.arXiv:2507.05513,

work page arXiv 2007
[37]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

URLhttps://arxiv.org/abs/2506.05176. Fengbin Zhu, Wenqiang Lei, Fuli Feng, Chao Wang, Haozhou Zhang, and Tat-Seng Chua. To- wards complex document understanding by discrete reasoning. InProceedings of the 30th ACM International Conference on Multimedia, pp. 4857–4866,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

A BACKGROUND Neural Information Retrieval represents a fundamental shift from traditional lexical matching meth- ods like BM25 (Robertson et al., 1995), TF-IDF (Salton & Buckley, 1988), and other term-based approaches (Zhai & Lafferty, 2017). Unlike these sparse retrieval methods that rely on exact term matches and statistical properties, neural approache...

work page 1995
[39]

The MaxSim equation, as defined in ColBERT (see eq

pioneered this approach with its MaxSim operation, which computes the maximum similar- ity between each query token and all passage tokens, then aggregates these scores. The MaxSim equation, as defined in ColBERT (see eq. 4)), finds for query token embeddingq i the maximum similarity (dot product) with any passage token embeddingp j. These maximum scores ...

work page 2021
[40]

(2025) for visual document retrieval

for text retrieval and Chaffin & Lac (2024); Wasserman et al. (2025) for visual document retrieval. A.1 VISUALDOCUMENTRETRIEVAL VERBALIZATION-BASED METHODSwere the dominant approach before the advent of end-to-end vision models. These pipelines convert visual documents into text through various techniques: tra- ditional Optical Character Recognition (OCR)...

work page 2024
[41]

After verbalization, these methods apply standard text retrieval techniques to the extracted content

extract printed text, while Vision-Language Models (VLMs) can generate textual descriptions of visual elements such as charts, diagrams, and infographics. After verbalization, these methods apply standard text retrieval techniques to the extracted content. While verbalization-based approaches can leverage powerful text-only retrieval models, they inherent...

work page 2024
[42]

ColPali provides native text-query support due to its VLM-based design

and adapting the late-interaction framework to vision-language models by treating image patches as visual tokens that interact with textual query tokens through MaxSim operations. ColPali provides native text-query support due to its VLM-based design. Queries remain text, are encoded by the model’s language tower, and are matched directly against visual p...

work page 2025
[43]

Docling is an open library providing OCR capabilities combined with document layout analysis, allowing us to recover page content via simple function calls

to ingest the images. Docling is an open library providing OCR capabilities combined with document layout analysis, allowing us to recover page content via simple function calls. The resulting text is stored alongside the page images without any chunking, ensuring consistent alignment between visual and textual page representations across the datasets. We...

work page 2025
[44]

Nevertheless, our method does not harm performance, in contrast to other hybrid retrieval methods, as seen in Tables 8 and

It is noticeable that this benchmark suffers from saturation, with many subset scores reaching90or higher (and indeed this was the direct motivation for the release of ViDoRe 2, Mac ´e et al., 2025). Nevertheless, our method does not harm performance, in contrast to other hybrid retrieval methods, as seen in Tables 8 and

work page 2025

[1] [3]

PaliGemma: A versatile 3B VLM for transfer

URL https://arxiv.org/abs/2407.07726. Sebastian Bruch, Siyu Gai, and Amir Ingber. An analysis of fusion functions for hybrid retrieval. ACM Transactions on Information Systems, 42(1), August

work page internal anchor Pith review Pith/arXiv arXiv

[2] [4]

Antoine Chaffin and Aur´elien Lac

URLhttps://doi.org/ 10.1145/3596512. Antoine Chaffin and Aur´elien Lac. Monoqwen: Visual document reranking,

work page doi:10.1145/3596512

[3] [5]

Tao Chen, Mingyang Zhang, Jing Lu, Michael Bendersky, and Marc Najork

URLhttps: //huggingface.co/lightonai/MonoQwen2-VL-v0.1. Tao Chen, Mingyang Zhang, Jing Lu, Michael Bendersky, and Marc Najork. Out-of-domain se- mantics to the rescue! Zero-shot hybrid retrieval models. InAdvances in Information Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part I, pp. 95...

work page 2022

[4] [6]

ISBN 978-3-030- 99735-9

Springer-Verlag. ISBN 978-3-030- 99735-9. URLhttps://doi.org/10.1007/978-3-030-99736-6_7. Chanyeol Choi, Junseong Kim, Seolhwa Lee, Jihoon Kwon, Sangmo Gu, Yejin Kim, Minkyung Cho, and Jy-yong Sohn. Linq-embed-mistral technical report.arXiv:2412.03223,

work page doi:10.1007/978-3-030-99736-6_7

[5] [7]

arXiv preprint arXiv:2412.03223

URL https://arxiv.org/abs/2412.03223. Benjamin Clavi´e and Florian Brand. Readbench: Measuring the dense text visual reading ability of vision-language models,

work page arXiv

[6] [8]

Gordon V

URLhttps://arxiv.org/abs/2505.19091. Gordon V . Cormack, Charles L A Clarke, and Stefan Buettcher. Reciprocal rank fusion outper- forms condorcet and individual rank learning methods. InSIGIR ’09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 758–759, New York, NY , USA,

work page arXiv

[7] [9]

URLhttp://doi.acm.org/10.1145/1571941. 1572114. C´ıcero dos Santos, Luciano Barbosa, Dasha Bogdanova, and Bianca Zadrozny. Learning hybrid rep- resentations to retrieve semantically equivalent questions. In Chengqing Zong and Michael Strube (eds.),Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Internati...

work page doi:10.1145/1571941

[8] [10]

ISBN 9798400715921

Association for Com- puting Machinery. ISBN 9798400715921. URLhttps://doi.org/10.1145/3726302. 3730160. Michael G ¨unther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, et al. jina-embeddings- v4: Universal embeddings for multimodal multilingual retrieval.arXiv:2506.18902,

work page doi:10.1145/3726302

[9] [11]

11 Hsin-Ling Hsu and Jengnan Tzeng

URL https://arxiv.org/abs/2506.18902. 11 Hsin-Ling Hsu and Jengnan Tzeng. DAT: Dynamic alpha tuning for hybrid retrieval in retrieval- augmented generation.arXiv:2503.23013,

work page arXiv

[10] [13]

Unsupervised Dense Information Retrieval with Contrastive Learning

URLhttps://arxiv.org/abs/2112.09118. Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learn- ing with noisy text supervision. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofICML, pp. 4904–4916. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [14]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih

URLhttps:// proceedings.mlr.press/v139/jia21b.html. Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.),Proceedings of the 2020 Conference on Empirical Methods in Natural Lan...

work page 2020

[12] [15]

URLhttps://aclanthology.org/ 2020.emnlp-main.550/

Association for Computational Linguistics. URLhttps://aclanthology.org/ 2020.emnlp-main.550/. Omar Khattab and Matei Zaharia. ColBERT: Efficient and effective passage search via contextual- ized late interaction over BERT. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp. 39–48,

work page 2020

[13] [17]

Binxu Li, Yuhui Zhang, Xiaohan Wang, Weixin Liang, Ludwig Schmidt, and Serena Yeung-Levy

URLhttps://arxiv.org/abs/2010.01195. Binxu Li, Yuhui Zhang, Xiaohan Wang, Weixin Liang, Ludwig Schmidt, and Serena Yeung-Levy. Closing the modality gap for mixed modality search,

work page arXiv 2010

[14] [18]

Junnan Li, Ramprasaath R

URLhttps://arxiv.org/ abs/2507.19054. Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momentum distillation. InAdvances in Neural Information Processing Systems, NeurIPS,

work page arXiv

[15] [19]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi

URLhttps://proceedings.neurips.cc/paper/2021/hash/ 505259756244493872b7709a8a01b536-Abstract.html. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. InInternational conference on machine learning, pp. 12888–12900. PMLR,

work page 2021

[16] [20]

URL https://aclanthology.org/2024.acl-long.775

Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.775. Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. Sparse, dense, and attentional representations for text retrieval.Transactions of the Association for Computational Linguistics, 9:329–345,

work page 2024

[17] [22]

Vidore benchmark v2: Raising the bar for visual retrieval.arXiv preprint arXiv:2505.17166, 2025

URLhttps://arxiv.org/abs/2505.17166. Minesh Mathew, Viraj Bagal, Rub `en P´erez Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V Jawahar. InfographicVQA, 2021a. URLhttps://arxiv.org/abs/2104.12756. Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for VQA on docu- ment images, 2021b. URLhttps://arxiv.org/abs/2007.00398. Rodrig...

work page arXiv 2007

[18] [24]

Representation Learning with Contrastive Predictive Coding

URLhttps://arxiv.org/abs/ 1807.03748. Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering.arXiv preprint arXiv:2010.08191,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[19] [25]

URLhttps: //arxiv.org/abs/2010.08191. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machin...

work page arXiv 2010

[20] [26]

Nils Reimers and Iryna Gurevych

URLhttps://proceedings.mlr.press/v139/radford21a.html. Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT- networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.),Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th In- ternational Joint Conference on Natur...

work page 2019

[21] [27]

Gerard Salton and Christopher Buckley

URLhttps://arxiv.org/abs/ 2505.03703. Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523,

work page arXiv

[22] [29]

Colbertv2: Effective and efficient retrieval via lightweight late interaction

URLhttps://arxiv.org/abs/2112.01488. Keshav Santhanam, Omar Khattab, Christopher Potts, and Matei Zaharia. PLAID: an efficient engine for late interaction retrieval. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 1747–1756,

work page arXiv

[23] [30]

Optimizing test-time query representations for dense retrieval

Mujeen Sung, Jungsoo Park, Jaewoo Kang, Danqi Chen, and Jinhyuk Lee. Optimizing test-time query representations for dense retrieval. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.),Findings of the Association for Computational Linguistics: ACL 2023, pp. 5731– 5746, Toronto, Canada, July

work page 2023

[24] [31]

URLhttps: //aclanthology.org/2023.findings-acl.354/

Association for Computational Linguistics. URLhttps: //aclanthology.org/2023.findings-acl.354/. Ryota Tanaka, Kyosuke Nishida, and Sen Yoshida. Visualmrc: Machine reading comprehension on document images. InAAAI,

work page 2023

[25] [32]

arXiv preprint arXiv:2408.09869 , year=

URLhttps: //arxiv.org/abs/2408.09869. 13 IBM Research Team. Granite-vision-3.3-2b-embedding, 2025a. URLhttps://huggingface. co/ibm-granite/granite-vision-3.3-2b-embedding. Nomic Team. Nomic embed multimodal: Interleaved text, image, and screenshots for visual document retrieval, 2025b. URLhttps://nomic.ai/blog/posts/ nomic-embed-multimodal. Peng Wang, Shu...

work page arXiv

[26] [33]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

URLhttps://arxiv. org/abs/2409.12191. Navve Wasserman, Oliver Heinimann, Yuval Golbari, Tal Zimbalist, Eli Schwartz, and Michal Irani. Docrerank: Single-page hard negative query generation for training multi-modal rag rerankers. arXiv preprint arXiv:2505.22584,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [34]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk

URLhttps://arxiv.org/abs/2505.22584. Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval.arXiv preprint arXiv:2007.00808,

work page arXiv 2007

[28] [35]

URLhttps://arxiv.org/abs/2007. 00808. Mengyao Xu, Gabriel Moreira, Ronay Ak, Radek Osmulski, Yauhen Babakhin, Zhiding Yu, Benedikt Schifferer, and Even Oldridge. Llama nemoretriever colembed: Top-performing text- image retrieval model.arXiv:2507.05513,

work page arXiv 2007

[29] [37]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

URLhttps://arxiv.org/abs/2506.05176. Fengbin Zhu, Wenqiang Lei, Fuli Feng, Chao Wang, Haozhou Zhang, and Tat-Seng Chua. To- wards complex document understanding by discrete reasoning. InProceedings of the 30th ACM International Conference on Multimedia, pp. 4857–4866,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [38]

A BACKGROUND Neural Information Retrieval represents a fundamental shift from traditional lexical matching meth- ods like BM25 (Robertson et al., 1995), TF-IDF (Salton & Buckley, 1988), and other term-based approaches (Zhai & Lafferty, 2017). Unlike these sparse retrieval methods that rely on exact term matches and statistical properties, neural approache...

work page 1995

[31] [39]

The MaxSim equation, as defined in ColBERT (see eq

pioneered this approach with its MaxSim operation, which computes the maximum similar- ity between each query token and all passage tokens, then aggregates these scores. The MaxSim equation, as defined in ColBERT (see eq. 4)), finds for query token embeddingq i the maximum similarity (dot product) with any passage token embeddingp j. These maximum scores ...

work page 2021

[32] [40]

(2025) for visual document retrieval

for text retrieval and Chaffin & Lac (2024); Wasserman et al. (2025) for visual document retrieval. A.1 VISUALDOCUMENTRETRIEVAL VERBALIZATION-BASED METHODSwere the dominant approach before the advent of end-to-end vision models. These pipelines convert visual documents into text through various techniques: tra- ditional Optical Character Recognition (OCR)...

work page 2024

[33] [41]

After verbalization, these methods apply standard text retrieval techniques to the extracted content

extract printed text, while Vision-Language Models (VLMs) can generate textual descriptions of visual elements such as charts, diagrams, and infographics. After verbalization, these methods apply standard text retrieval techniques to the extracted content. While verbalization-based approaches can leverage powerful text-only retrieval models, they inherent...

work page 2024

[34] [42]

ColPali provides native text-query support due to its VLM-based design

and adapting the late-interaction framework to vision-language models by treating image patches as visual tokens that interact with textual query tokens through MaxSim operations. ColPali provides native text-query support due to its VLM-based design. Queries remain text, are encoded by the model’s language tower, and are matched directly against visual p...

work page 2025

[35] [43]

Docling is an open library providing OCR capabilities combined with document layout analysis, allowing us to recover page content via simple function calls

to ingest the images. Docling is an open library providing OCR capabilities combined with document layout analysis, allowing us to recover page content via simple function calls. The resulting text is stored alongside the page images without any chunking, ensuring consistent alignment between visual and textual page representations across the datasets. We...

work page 2025

[36] [44]

Nevertheless, our method does not harm performance, in contrast to other hybrid retrieval methods, as seen in Tables 8 and

It is noticeable that this benchmark suffers from saturation, with many subset scores reaching90or higher (and indeed this was the direct motivation for the release of ViDoRe 2, Mac ´e et al., 2025). Nevertheless, our method does not harm performance, in contrast to other hybrid retrieval methods, as seen in Tables 8 and

work page 2025