miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity

Anhao Zhao; Junlong Tong; Kai Zou; Ping Nie; Wei Zhang; Xiaoyu Shen; Xuan Lu; Yingqi Fan; Yunpu Ma

arxiv: 2606.10759 · v2 · pith:PSKJQJ47new · submitted 2026-06-09 · 💻 cs.IR

miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity

Yingqi Fan , Xuan Lu , Anhao Zhao , Junlong Tong , Ping Nie , Kai Zou , Yunpu Ma , Wei Zhang

show 1 more author

Xiaoyu Shen

This is my paper

Pith reviewed 2026-06-27 11:34 UTC · model grok-4.3

classification 💻 cs.IR

keywords multimodal rerankingvision-first promptingcache reuseinteraction sparsityMLLM efficiencyearly exittoken pruning

0 comments

The pith

A vision-first prompting format plus three sparsity interventions lets multimodal rerankers run at under 1 percent of dense-model runtime while keeping over 96 percent of original relevance accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that point-wise reranking with multimodal LLMs wastes computation because standard query-first or document-first formats prevent cache reuse across pairs. Switching to a vision-first order aligns the prompt with both VQA-style inputs and the causal mask, so visual tokens can be cached once per query. Three additional controls—early exit from model layers, a narrow cross-segment attention band, and embedder-guided visual-token pruning—further cut active computation. Together these changes produce the miniReranker design that delivers the stated efficiency and accuracy numbers.

Core claim

The authors claim that a vision-first formulation improves both cache reuse and relevance modeling, and that the combination of early exit, restricted cross-segment attention, and embedder-guided pruning reduces reranking runtime to less than 1 percent of the dense baseline under high-reuse conditions for a single query while retaining more than 96 percent of the dense model's performance.

What carries the argument

The vision-first prompt formulation together with early-exit, narrow interaction-band attention, and embedder-guided visual-token pruning.

If this is right

Reranking latency becomes low enough for real-time use inside large-scale multimodal search pipelines.
Visual tokens can be cached once per query and reused across many candidate documents.
Model depth and attention cost scale independently of the number of documents being reranked.
The same sparsity pattern can be applied to other point-wise MLLM scoring tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cache-reuse pattern may extend to any task where one modality is fixed across many comparisons.
Pruning guided by a lightweight embedder could be tested on non-visual modalities if an analogous cheap signal exists.
Early-exit thresholds might be learned per layer rather than fixed, potentially recovering more accuracy at the same speed.

Load-bearing premise

The three sparsity interventions do not materially reduce the MLLM's ability to judge query-document relevance.

What would settle it

Measure NDCG or recall on a held-out multimodal retrieval set after applying all three sparsity methods; if accuracy falls below 96 percent of the dense baseline the central efficiency claim no longer holds.

Figures

Figures reproduced from arXiv: 2606.10759 by Anhao Zhao, Junlong Tong, Kai Zou, Ping Nie, Wei Zhang, Xiaoyu Shen, Xuan Lu, Yingqi Fan, Yunpu Ma.

**Figure 1.** Figure 1: Overview of miniReranker. Left: the proposed Vision-first reformulation enables reusable visual precaching for both vision-as-document and vision-as-query settings. Right: miniReranker further improves efficiency through three complementary compression strategies: (1) Early Exit, which reduces depth-wise computation by terminating inference at intermediate layers; (2) Interaction Band, which restricts cro… view at source ↗

**Figure 2.** Figure 2: Layer-wise Logit Probing reveals substantial depth-wise redundancy in multimodal reranking, while Cross-segment Interaction Analysis shows that effective cross-segment information exchange is concentrated within a narrow range of intermediate layers. first strategy, the reduction in online FLOPs is ∆CV→T = Cd−first − Cq−first = O [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Training throughput comparison. Training Hours. Our compression framework mitigates multimodal reranker training bottlenecks via: (1) early exit, reduce the number of updated parameters; and (2) visual token pruning, shorten the long multimodal sequences. These optimizations jointly reduce both forward and backward computation costs, miniReranker achieves nearly 3× training acceleration compared with the … view at source ↗

**Figure 5.** Figure 5: Latency scaling in the vision-as-document setting. Reranking Latency: Ablation. We further analyze the contribution of each compression component to reranking acceleration. We scale the number of candidates and report the latency averaged over the two reuse scenarios and tasks. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 4.** Figure 4: Latency scaling in the vision-as-query setting. Reranking Latency: Vision as Document. For the vision-as-document setting, we evaluate on MS COCOt2i and the video retrieval benchmark MSR-VTT. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 7.** Figure 7: Layer-wise probing on general VQA tasks. We evaluate prefill-only yes/no tasks and multiple-choice tasks using intermediate-layer logits. Unlike point-wise reranking, general VQA tasks only recover final-layer performance at much deeper layers, typically around layer 22 or later. Open-ended Tasks. We also evaluate openended VQA tasks, where the model needs to generate free-form answers. Since full layer-… view at source ↗

**Figure 8.** Figure 8: Qwen3-VL-2B-rerankerDF † . 1 16 28 Layer 0% 20% 100% Performance 95% Video VisDoc Image [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Qwen3-VL-2B-rerankerV F † . 1 16 28 Layer 0% 20% 100% Performance 95% Video VisDoc Image [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Qwen3-VL-2B-rerankerQF † . As shown in [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt template for vision-as-query reranking, where the query-side visual input is placed before the candidate document to enable reuse across candidates. Vision-as-document. For tasks where the visual input belongs to the document, e.g., textto-image or image-to-image retrieval, we instead place the document before the query. This makes the document-side visual representations independent of the inc… view at source ↗

**Figure 12.** Figure 12: Prompt template for vision-as-document reranking, where the document-side visual input is placed before the query to enable reuse across queries. E Ablation on Visual Token Selection To further validate the effectiveness of our embedder-attention-guided token selection strategy, we compare it with several alternative visual token selection methods. For fair comparison, all methods prune visual tokens be… view at source ↗

**Figure 13.** Figure 13: End-to-end latency including visual pre-encoding and cache construction overhead in the vision-as-query setting, measured under different numbers of candidate documents. Vision-as-document setting. In this setting, document-side visual representations are cached once and reused across many incoming queries. Since the reuse frequency increases with the number of queries, we report latency scaling with re… view at source ↗

**Figure 14.** Figure 14: End-to-end latency including visual pre [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

read the original abstract

Multimodal large language models (MLLMs) have recently shown strong potential as point-wise rerankers by directly modeling query--document relevance through next-token prediction. However, point-wise reranking suffers from substantial repeated computation across query--document pairs, while the causal structure of transformers allows only prefix segments to be reused via pre-caching. To address the misalignment of existing query-first and document-first formats with both VQA-style prompting and computation-aware reuse, we propose a $\textit{vision-first}$ formulation that improves both cache reuse efficiency and reranking performance. However, the remaining cost is still considerable and stems from three main sources: (1) $\textit{model depth}$, for which we reduce active parameters via early exit; (2) $\textit{cross-segment attention}$, which we restrict to a narrow interaction band across a few layers; and (3) $\textit{visual tokens}$, where we reduce the number of tokens via embedder-guided pruning. Together, these designs form miniReranker, which reduces reranking runtime to <1% of the dense implementation under high-reuse settings for a single query, while preserving >96% of the dense model performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Efficiency optimizations for MLLM pointwise rerankers via vision-first prompting and sparsity, but the abstract supplies no datasets, baselines, or controls to evaluate the claims.

read the letter

The core idea is a set of practical changes to make MLLM-based reranking cheaper: switch to a vision-first prompt order for better prefix caching, then apply early exit on depth, restrict cross-segment attention to a narrow band in a few layers, and prune visual tokens using an embedder. These are combined into miniReranker.

What stands out is the explicit focus on cache reuse misalignment in existing query-first or document-first formats and the three concrete sparsity levers aimed at the remaining cost. That combination is not just a routine extension of prior cache or sparsity work.

The soft spot is obvious from the abstract alone: it states runtime drops to under 1% of dense and retains over 96% performance, yet gives zero information on the models, datasets, baselines, number of queries, or any statistical checks. Without those, the central efficiency and accuracy claims cannot be assessed. The weakest assumption—that the vision-first change plus the three interventions preserve relevance judgment—remains untested in the provided text.

This is for people building production multimodal retrieval systems who already have MLLM rerankers and need lower latency under repeated queries. A reader already working on cache-aware inference or token pruning might find the specific integration useful, but only after seeing the actual experiments.

I would not send it to peer review in its current form because the soundness gap is load-bearing; the paper needs the methods, data, and ablations filled in before a referee can do useful work.

Referee Report

2 major / 0 minor

Summary. The paper proposes miniReranker for efficient point-wise multimodal reranking with MLLMs. It introduces a vision-first prompting formulation to improve KV-cache reuse over query-first or document-first formats, then applies three sparsity interventions—early exit to reduce active model depth, a narrow interaction band to limit cross-segment attention in selected layers, and embedder-guided pruning to reduce visual tokens. The central claim is that these changes together reduce reranking runtime to <1% of a dense baseline under high-reuse single-query settings while retaining >96% of the dense model's relevance performance.

Significance. If the reported efficiency and accuracy numbers are reproducible, the work would address a practical bottleneck in deploying MLLM rerankers at scale by exploiting cache reuse and structured sparsity rather than model compression or distillation. The vision-first reformulation and the three targeted sparsity mechanisms are concrete engineering contributions that could be adopted in production retrieval pipelines.

major comments (2)

[Abstract] Abstract: the manuscript states concrete runtime (<1% of dense) and accuracy (>96% retention) figures yet supplies no experimental section, datasets, baselines, number of queries/documents, hardware, or statistical details; without these the central claim cannot be evaluated.
No equations, derivations, or complexity analysis appear in the provided text; the efficiency claims rest entirely on unreported empirical measurements, leaving open whether the reported gains are parameter-free or depend on specific hyper-parameter choices for the interaction band and pruning thresholds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to ensure all claims are fully supported and evaluable.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript states concrete runtime (<1% of dense) and accuracy (>96% retention) figures yet supplies no experimental section, datasets, baselines, number of queries/documents, hardware, or statistical details; without these the central claim cannot be evaluated.

Authors: We agree that the provided manuscript text (limited to the abstract) does not include an experimental section or supporting details. The full submission will be revised to incorporate a dedicated Experiments section reporting the specific datasets, baselines, query/document counts, hardware platform, and statistical measures (including variance across runs) that underpin the <1% runtime and >96% retention figures. revision: yes
Referee: [—] No equations, derivations, or complexity analysis appear in the provided text; the efficiency claims rest entirely on unreported empirical measurements, leaving open whether the reported gains are parameter-free or depend on specific hyper-parameter choices for the interaction band and pruning thresholds.

Authors: We acknowledge the absence of equations and complexity analysis in the current text. We will add a dedicated Analysis section that formally defines the vision-first prompting, early-exit criterion, narrow interaction band, and embedder-guided pruning, derives the resulting complexity reductions, and explicitly states the hyper-parameter values chosen for the interaction band width and pruning thresholds. We will also include a sensitivity study showing how runtime and relevance vary with these choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical engineering design paper focused on practical optimizations (vision-first formulation, early exit, narrow interaction band, embedder-guided pruning) for multimodal reranking efficiency. No equations, derivations, or mathematical claims are present in the provided abstract or description. There are no load-bearing steps that reduce predictions to inputs by construction, no fitted parameters presented as independent predictions, and no self-citation chains invoked to justify uniqueness theorems or ansatzes. The central claims rest on design choices and reported empirical performance metrics rather than any self-referential logic, making the work self-contained against external benchmarks with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, hyperparameters, or modeling assumptions; no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5767 in / 1082 out tokens · 22749 ms · 2026-06-27T11:34:50.371121+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 9 canonical work pages

[1]

2023 , eprint=

Attention Is All You Need , author=. 2023 , eprint=

2023
[2]

Zhanpeng Chen and Chengjin Xu and Yiyan Qi and Jian Guo , year=
[3]

2020 , eprint=

Passage Re-ranking with BERT , author=. 2020 , eprint=

2020
[4]

2023 , eprint=

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers , author=. 2023 , eprint=

2023
[5]

2021 , eprint=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. 2021 , eprint=

2021
[6]

2019 , eprint=

Multi-Stage Document Ranking with BERT , author=. 2019 , eprint=

2019
[7]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[8]

2024 , eprint=

Efficient Streaming Language Models with Attention Sinks , author=. 2024 , eprint=

2024
[9]

arXiv preprint arXiv:2511.21631 , year=

Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=

Pith/arXiv arXiv
[10]

2024 , eprint=

Improved Baselines with Visual Instruction Tuning , author=. 2024 , eprint=

2024
[11]

2024 , eprint=

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models , author=. 2024 , eprint=

2024
[12]

2022 , eprint=

Flamingo: a Visual Language Model for Few-Shot Learning , author=. 2022 , eprint=

2022
[13]

2023 , eprint=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

2023
[14]

2024 , eprint=

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks , author=. 2024 , eprint=

2024
[15]

2024 , eprint=

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=

2024
[16]

From Data to Model: A Survey of the Compression Lifecycle in MLLMs , url=

Wu, Hao and Tong, Junlong and Wang, Xudong and Tan, Yang and Zeng, Changyu and Antsiferova, Anastasia and Shen, Xiaoyu , year=. From Data to Model: A Survey of the Compression Lifecycle in MLLMs , url=. doi:10.36227/techrxiv.177220375.55495124/v1 , publisher=

work page doi:10.36227/techrxiv.177220375.55495124/v1
[17]

2021 , eprint=

Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

2021
[18]

2024 , eprint=

Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting , author=. 2024 , eprint=

2024
[19]

2007 , isbn =

Cao, Zhe and Qin, Tao and Liu, Tie-Yan and Tsai, Ming-Feng and Li, Hang , title =. 2007 , isbn =. doi:10.1145/1273496.1273513 , booktitle =

work page doi:10.1145/1273496.1273513 2007
[20]

ISBN 978-1-60558-205-4

Xia, Fen and Liu, Tie-Yan and Wang, Jue and Zhang, Wensheng and Li, Hang , title =. 2008 , isbn =. doi:10.1145/1390156.1390306 , booktitle =

work page doi:10.1145/1390156.1390306 2008
[21]

2025 , eprint=

Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking , author=. 2025 , eprint=

2025
[22]

2026 , eprint=

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking , author=. 2026 , eprint=

2026
[23]

2025 , eprint=

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=

2025
[24]

VLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training

Chen, Zhanpeng and Xu, Chengjin and Qi, Yiyan and Jiang, Xuhui and Guo, Jian. VLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.432

work page doi:10.18653/v1/2025.findings-emnlp.432 2025
[25]

arXiv preprint arXiv:2506.12364 , year=

Mm-r5: Multimodal reasoning-enhanced reranker via reinforcement learning for document retrieval , author=. arXiv preprint arXiv:2506.12364 , year=

arXiv
[26]

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant , author=. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. 2025 , organization=

2025
[27]

2025 , url=

Sheng-Chieh Lin and Chankyu Lee and Mohammad Shoeybi and Jimmy Lin and Bryan Catanzaro and Wei Ping , booktitle=. 2025 , url=

2025
[28]

2026 , eprint=

Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval , author=. 2026 , eprint=

2026
[29]

2025 , eprint=

The Evolution of Reranking Models in Information Retrieval: From Heuristic Methods to Large Language Models , author=. 2025 , eprint=

2025
[30]

2025 , eprint=

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation , author=. 2025 , eprint=

2025
[31]

TechRxiv , volume =

Yinxin Zhou and Qin Luo and Bin Feng and Bang Wang , title =. TechRxiv , volume =. 2025 , doi =

2025
[32]

Bridging Modalities: Improving Universal Multimodal Retrieval by Multimodal Large Language Models , year=

Zhang, Xin and Zhang, Yanzhao and Xie, Wen and Li, Mingxin and Dai, Ziqi and Long, Dingkun and Xie, Pengjun and Zhang, Meishan and Li, Wenjie and Zhang, Min , booktitle=. Bridging Modalities: Improving Universal Multimodal Retrieval by Multimodal Large Language Models , year=
[33]

Towards Text-Image Interleaved Retrieval

Zhang, Xin and Dai, Ziqi and Li, Yongqi and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan and Yu, Jun and Li, Wenjie and Zhang, Min. Towards Text-Image Interleaved Retrieval. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.214

work page doi:10.18653/v1/2025.acl-long.214 2025
[34]

2020 , eprint=

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , author=. 2020 , eprint=

2020
[35]

2025 , eprint=

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks , author=. 2025 , eprint=

2025
[36]

2025 , eprint=

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents , author=. 2025 , eprint=

2025
[37]

2026 , eprint=

ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios , author=. 2026 , eprint=

2026
[38]

2025 , eprint=

jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval , author=. 2025 , eprint=

2025
[39]

2024 , eprint=

MMBench: Is Your Multi-modal Model an All-around Player? , author=. 2024 , eprint=

2024
[40]

2023 , eprint=

Evaluating Object Hallucination in Large Vision-Language Models , author=. 2023 , eprint=

2023
[41]

2025 , eprint=

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. 2025 , eprint=

2025
[42]

2022 , eprint=

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. 2022 , eprint=

2022
[43]

2024 , eprint=

Are We on the Right Way for Evaluating Large Vision-Language Models? , author=. 2024 , eprint=

2024
[44]

2019 , eprint=

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , author=. 2019 , eprint=

2019
[45]

2019 , eprint=

Towards VQA Models That Can Read , author=. 2019 , eprint=

2019
[46]

arXiv preprint arXiv:2404.01258 , year=

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward , author=. arXiv preprint arXiv:2404.01258 , year=

arXiv
[47]

2024 , eprint=

ColPali: Efficient Document Retrieval with Vision Language Models , author=. 2024 , eprint=

2024
[48]

2025 , eprint=

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents , author=. 2025 , eprint=

2025
[49]

2024 , eprint=

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models , author=. 2024 , eprint=

2024
[50]

2025 , eprint=

Do Language Models Use Their Depth Efficiently? , author=. 2025 , eprint=

2025
[51]

2026 , eprint=

The Curse of Depth in Large Language Models , author=. 2026 , eprint=

2026
[52]

2025 , eprint=

Layer by Layer: Uncovering Hidden Representations in Language Models , author=. 2025 , eprint=

2025
[53]

2025 , eprint=

The Remarkable Robustness of LLMs: Stages of Inference? , author=. 2025 , eprint=

2025
[54]

2024 , eprint=

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models , author=. 2024 , eprint=

2024
[55]

2026 , eprint=

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention , author=. 2026 , eprint=

2026
[56]

2026 , eprint=

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter , author=. 2026 , eprint=

2026
[57]

HiDrop: Hierarchical Vision Token Reduction in

Hao Wu and Yingqi Fan and Dai Jinyang and Junlong Tong and Yunpu Ma and Xiaoyu Shen , booktitle=. HiDrop: Hierarchical Vision Token Reduction in. 2026 , url=

2026
[58]

To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Lin, Junyan and Chen, Haoran and Zhu, Dawei and Shen, Xiaoyu. To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.325

work page doi:10.18653/v1/2024.emnlp-main.325 2024
[59]

2026 , eprint=

What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models , author=. 2026 , eprint=

2026
[60]

2510.17205 , archivePrefix=

Yingqi Fan and Anhao Zhao and Jinlan Fu and Junlong Tong and Hui Su and Yijie Pan and Wei Zhang and Xiaoyu Shen , year=. 2510.17205 , archivePrefix=

arXiv
[61]

2021 , journal=

A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

2021
[62]

2020 , url =

nostalgebraist , title =. 2020 , url =

2020
[63]

Proceedings of the ACM Web Conference 2026 , pages=

LongRanker: Efficient One-Pass Document Reranking with Long-Context Large Language Models , author=. Proceedings of the ACM Web Conference 2026 , pages=

2026
[64]

2026 , eprint=

Very Efficient Listwise Multimodal Reranking for Long Documents , author=. 2026 , eprint=

2026
[65]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Scalable In-context Ranking with Generative Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[66]

The Thirteenth International Conference on Learning Representations , year=

Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers , author=. The Thirteenth International Conference on Learning Representations , year=
[67]

Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking

Zhang, Wuwei and Yin, Fangcong and Yen, Howard and Chen, Danqi and Ye, Xi. Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1214

work page doi:10.18653/v1/2025.emnlp-main.1214 2025
[68]

2025 , eprint=

HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse , author=. 2025 , eprint=

2025
[69]

2025 , eprint=

Reranking with Compressed Document Representation , author=. 2025 , eprint=

2025
[70]

2026 , eprint=

Efficient Long-Document Reranking via Block-Level Embeddings and Top-k Interaction Refinement , author=. 2026 , eprint=

2026
[71]

2025 , eprint=

RankLLM: A Python Package for Reranking with LLMs , author=. 2025 , eprint=

2025
[72]

2025 , eprint=

Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning , author=. 2025 , eprint=

2025
[73]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
[74]

2024 , eprint=

A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs , author=. 2024 , eprint=

2024
[75]

2025 , eprint=

PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models , author=. 2025 , eprint=

2025
[76]

VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference , url=

Jiang, Pengfei and Li, Hanjun and Zhao, Linglan and Chao, Fei and Yan, Ke and Ding, Shouhong and Ji, Rongrong , year=. VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference , url=. doi:10.1145/3746027.3755792 , booktitle=

work page doi:10.1145/3746027.3755792
[77]

2025 , eprint=

See What You Are Told: Visual Attention Sink in Large Multimodal Models , author=. 2025 , eprint=

2025

[1] [1]

2023 , eprint=

Attention Is All You Need , author=. 2023 , eprint=

2023

[2] [2]

Zhanpeng Chen and Chengjin Xu and Yiyan Qi and Jian Guo , year=

[3] [3]

2020 , eprint=

Passage Re-ranking with BERT , author=. 2020 , eprint=

2020

[4] [4]

2023 , eprint=

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers , author=. 2023 , eprint=

2023

[5] [5]

2021 , eprint=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. 2021 , eprint=

2021

[6] [6]

2019 , eprint=

Multi-Stage Document Ranking with BERT , author=. 2019 , eprint=

2019

[7] [7]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024

[8] [8]

2024 , eprint=

Efficient Streaming Language Models with Attention Sinks , author=. 2024 , eprint=

2024

[9] [9]

arXiv preprint arXiv:2511.21631 , year=

Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=

Pith/arXiv arXiv

[10] [10]

2024 , eprint=

Improved Baselines with Visual Instruction Tuning , author=. 2024 , eprint=

2024

[11] [11]

2024 , eprint=

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models , author=. 2024 , eprint=

2024

[12] [12]

2022 , eprint=

Flamingo: a Visual Language Model for Few-Shot Learning , author=. 2022 , eprint=

2022

[13] [13]

2023 , eprint=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

2023

[14] [14]

2024 , eprint=

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks , author=. 2024 , eprint=

2024

[15] [15]

2024 , eprint=

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=

2024

[16] [16]

From Data to Model: A Survey of the Compression Lifecycle in MLLMs , url=

Wu, Hao and Tong, Junlong and Wang, Xudong and Tan, Yang and Zeng, Changyu and Antsiferova, Anastasia and Shen, Xiaoyu , year=. From Data to Model: A Survey of the Compression Lifecycle in MLLMs , url=. doi:10.36227/techrxiv.177220375.55495124/v1 , publisher=

work page doi:10.36227/techrxiv.177220375.55495124/v1

[17] [17]

2021 , eprint=

Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

2021

[18] [18]

2024 , eprint=

Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting , author=. 2024 , eprint=

2024

[19] [19]

2007 , isbn =

Cao, Zhe and Qin, Tao and Liu, Tie-Yan and Tsai, Ming-Feng and Li, Hang , title =. 2007 , isbn =. doi:10.1145/1273496.1273513 , booktitle =

work page doi:10.1145/1273496.1273513 2007

[20] [20]

ISBN 978-1-60558-205-4

Xia, Fen and Liu, Tie-Yan and Wang, Jue and Zhang, Wensheng and Li, Hang , title =. 2008 , isbn =. doi:10.1145/1390156.1390306 , booktitle =

work page doi:10.1145/1390156.1390306 2008

[21] [21]

2025 , eprint=

Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking , author=. 2025 , eprint=

2025

[22] [22]

2026 , eprint=

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking , author=. 2026 , eprint=

2026

[23] [23]

2025 , eprint=

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=

2025

[24] [24]

VLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training

Chen, Zhanpeng and Xu, Chengjin and Qi, Yiyan and Jiang, Xuhui and Guo, Jian. VLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.432

work page doi:10.18653/v1/2025.findings-emnlp.432 2025

[25] [25]

arXiv preprint arXiv:2506.12364 , year=

Mm-r5: Multimodal reasoning-enhanced reranker via reinforcement learning for document retrieval , author=. arXiv preprint arXiv:2506.12364 , year=

arXiv

[26] [26]

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant , author=. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. 2025 , organization=

2025

[27] [27]

2025 , url=

Sheng-Chieh Lin and Chankyu Lee and Mohammad Shoeybi and Jimmy Lin and Bryan Catanzaro and Wei Ping , booktitle=. 2025 , url=

2025

[28] [28]

2026 , eprint=

Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval , author=. 2026 , eprint=

2026

[29] [29]

2025 , eprint=

The Evolution of Reranking Models in Information Retrieval: From Heuristic Methods to Large Language Models , author=. 2025 , eprint=

2025

[30] [30]

2025 , eprint=

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation , author=. 2025 , eprint=

2025

[31] [31]

TechRxiv , volume =

Yinxin Zhou and Qin Luo and Bin Feng and Bang Wang , title =. TechRxiv , volume =. 2025 , doi =

2025

[32] [32]

Bridging Modalities: Improving Universal Multimodal Retrieval by Multimodal Large Language Models , year=

Zhang, Xin and Zhang, Yanzhao and Xie, Wen and Li, Mingxin and Dai, Ziqi and Long, Dingkun and Xie, Pengjun and Zhang, Meishan and Li, Wenjie and Zhang, Min , booktitle=. Bridging Modalities: Improving Universal Multimodal Retrieval by Multimodal Large Language Models , year=

[33] [33]

Towards Text-Image Interleaved Retrieval

Zhang, Xin and Dai, Ziqi and Li, Yongqi and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan and Yu, Jun and Li, Wenjie and Zhang, Min. Towards Text-Image Interleaved Retrieval. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.214

work page doi:10.18653/v1/2025.acl-long.214 2025

[34] [34]

2020 , eprint=

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , author=. 2020 , eprint=

2020

[35] [35]

2025 , eprint=

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks , author=. 2025 , eprint=

2025

[36] [36]

2025 , eprint=

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents , author=. 2025 , eprint=

2025

[37] [37]

2026 , eprint=

ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios , author=. 2026 , eprint=

2026

[38] [38]

2025 , eprint=

jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval , author=. 2025 , eprint=

2025

[39] [39]

2024 , eprint=

MMBench: Is Your Multi-modal Model an All-around Player? , author=. 2024 , eprint=

2024

[40] [40]

2023 , eprint=

Evaluating Object Hallucination in Large Vision-Language Models , author=. 2023 , eprint=

2023

[41] [41]

2025 , eprint=

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. 2025 , eprint=

2025

[42] [42]

2022 , eprint=

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. 2022 , eprint=

2022

[43] [43]

2024 , eprint=

Are We on the Right Way for Evaluating Large Vision-Language Models? , author=. 2024 , eprint=

2024

[44] [44]

2019 , eprint=

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , author=. 2019 , eprint=

2019

[45] [45]

2019 , eprint=

Towards VQA Models That Can Read , author=. 2019 , eprint=

2019

[46] [46]

arXiv preprint arXiv:2404.01258 , year=

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward , author=. arXiv preprint arXiv:2404.01258 , year=

arXiv

[47] [47]

2024 , eprint=

ColPali: Efficient Document Retrieval with Vision Language Models , author=. 2024 , eprint=

2024

[48] [48]

2025 , eprint=

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents , author=. 2025 , eprint=

2025

[49] [49]

2024 , eprint=

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models , author=. 2024 , eprint=

2024

[50] [50]

2025 , eprint=

Do Language Models Use Their Depth Efficiently? , author=. 2025 , eprint=

2025

[51] [51]

2026 , eprint=

The Curse of Depth in Large Language Models , author=. 2026 , eprint=

2026

[52] [52]

2025 , eprint=

Layer by Layer: Uncovering Hidden Representations in Language Models , author=. 2025 , eprint=

2025

[53] [53]

2025 , eprint=

The Remarkable Robustness of LLMs: Stages of Inference? , author=. 2025 , eprint=

2025

[54] [54]

2024 , eprint=

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models , author=. 2024 , eprint=

2024

[55] [55]

2026 , eprint=

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention , author=. 2026 , eprint=

2026

[56] [56]

2026 , eprint=

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter , author=. 2026 , eprint=

2026

[57] [57]

HiDrop: Hierarchical Vision Token Reduction in

Hao Wu and Yingqi Fan and Dai Jinyang and Junlong Tong and Yunpu Ma and Xiaoyu Shen , booktitle=. HiDrop: Hierarchical Vision Token Reduction in. 2026 , url=

2026

[58] [58]

To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Lin, Junyan and Chen, Haoran and Zhu, Dawei and Shen, Xiaoyu. To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.325

work page doi:10.18653/v1/2024.emnlp-main.325 2024

[59] [59]

2026 , eprint=

What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models , author=. 2026 , eprint=

2026

[60] [60]

2510.17205 , archivePrefix=

Yingqi Fan and Anhao Zhao and Jinlan Fu and Junlong Tong and Hui Su and Yijie Pan and Wei Zhang and Xiaoyu Shen , year=. 2510.17205 , archivePrefix=

arXiv

[61] [61]

2021 , journal=

A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

2021

[62] [62]

2020 , url =

nostalgebraist , title =. 2020 , url =

2020

[63] [63]

Proceedings of the ACM Web Conference 2026 , pages=

LongRanker: Efficient One-Pass Document Reranking with Long-Context Large Language Models , author=. Proceedings of the ACM Web Conference 2026 , pages=

2026

[64] [64]

2026 , eprint=

Very Efficient Listwise Multimodal Reranking for Long Documents , author=. 2026 , eprint=

2026

[65] [65]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Scalable In-context Ranking with Generative Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[66] [66]

The Thirteenth International Conference on Learning Representations , year=

Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers , author=. The Thirteenth International Conference on Learning Representations , year=

[67] [67]

Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking

Zhang, Wuwei and Yin, Fangcong and Yen, Howard and Chen, Danqi and Ye, Xi. Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1214

work page doi:10.18653/v1/2025.emnlp-main.1214 2025

[68] [68]

2025 , eprint=

HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse , author=. 2025 , eprint=

2025

[69] [69]

2025 , eprint=

Reranking with Compressed Document Representation , author=. 2025 , eprint=

2025

[70] [70]

2026 , eprint=

Efficient Long-Document Reranking via Block-Level Embeddings and Top-k Interaction Refinement , author=. 2026 , eprint=

2026

[71] [71]

2025 , eprint=

RankLLM: A Python Package for Reranking with LLMs , author=. 2025 , eprint=

2025

[72] [72]

2025 , eprint=

Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning , author=. 2025 , eprint=

2025

[73] [73]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

[74] [74]

2024 , eprint=

A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs , author=. 2024 , eprint=

2024

[75] [75]

2025 , eprint=

PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models , author=. 2025 , eprint=

2025

[76] [76]

VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference , url=

Jiang, Pengfei and Li, Hanjun and Zhao, Linglan and Chao, Fei and Yan, Ke and Ding, Shouhong and Ji, Rongrong , year=. VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference , url=. doi:10.1145/3746027.3755792 , booktitle=

work page doi:10.1145/3746027.3755792

[77] [77]

2025 , eprint=

See What You Are Told: Visual Attention Sink in Large Multimodal Models , author=. 2025 , eprint=

2025