arxiv: 2605.11864 · v1 · submitted 2026-05-12 · 💻 cs.IR · cs.AI· cs.CV· cs.MM

Recognition: 1 theorem link

· Lean Theorem

Very Efficient Listwise Multimodal Reranking for Long Documents

Yiqun Sun , Pengfei Wei , Lawrence B. Hsieh

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:12 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CVcs.MM

keywords multimodal rerankinglistwise rerankingefficient retrievallong documentsvision-language modelsinference optimizationMMDocIR benchmark

0 comments

The pith

ZipRerank achieves state-of-the-art multimodal listwise reranking accuracy on long documents while cutting inference latency by up to an order of magnitude.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ZipRerank as a listwise reranker for multimodal documents that shortens visual inputs through early query-image interaction and scores every candidate in one model pass rather than generating tokens step by step. It trains first on large-scale text rendered as images then fine-tunes with soft ranking labels distilled from a larger vision-language model. If this works, reranking becomes fast enough to use routinely in vision-centric search and multimodal retrieval-augmented generation over long documents without sacrificing ranking quality.

Core claim

ZipRerank reduces input length via lightweight query-image early interaction and replaces autoregressive decoding with single-forward-pass scoring of all candidates. A two-stage training process—listwise pretraining on rendered text images followed by multimodal fine-tuning using VLM-teacher-distilled soft supervision—enables the model to match or exceed prior multimodal rerankers on the MMDocIR benchmark while lowering LLM inference latency by up to ten times.

What carries the argument

Lightweight query-image early interaction combined with single-pass listwise scoring that avoids autoregressive token generation.

If this is right

Listwise reranking can now be applied in real-time multimodal retrieval systems that previously found it too slow.
The same efficiency pattern extends directly to other long-document vision-language tasks that rely on ranking.
Two-stage training starting from rendered text allows reuse of large text datasets for multimodal rerankers.
Single-pass scoring removes the need for multi-step decoding in listwise settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may scale to documents longer than those tested if the early-interaction compression remains effective.
Similar early-interaction designs could be tested in non-ranking multimodal generation pipelines to reduce token counts.
If the distilled supervision generalizes, the approach offers a template for distilling efficiency into other vision-language ranking models.

Load-bearing premise

That the early interaction and single-pass scoring, after training that begins with rendered text images, continue to preserve ranking quality on actual long multimodal documents.

What would settle it

Compare ZipRerank against a standard VLM-based listwise reranker on the same MMDocIR queries; if NDCG or similar ranking metrics drop by more than a few points while claimed latency gains hold, the efficiency claim is refuted.

Figures

Figures reproduced from arXiv: 2605.11864 by Lawrence B. Hsieh, Pengfei Wei, Yiqun Sun.

**Figure 1.** Figure 1: Speed-accuracy trade-off on MMDocIR (Dong et al., 2025) for page-level reranking (Recall@3 vs. LLM latency). ZipRerank (red; varying token keep ratios ρ) achieves state-of-theart performance comparable to MM-R5 (Xu et al., 2025) while reducing latency by around 10×, and substantially narrows the gap to GPT-5-mini at about 58× lower inference cost. ments are often visually rich and span many pages, requiri… view at source ↗

**Figure 2.** Figure 2: Overview of the ZipRerank framework. ZipRerank integrates a two-stage training pipeline: (i) listwise pretraining on large-scale text data rendered as images, followed by (ii) vision-centric finetuning with VLM-teacher-distilled soft supervision. The efficient inference design combines query-aware visual token pruning and single-token listwise scoring, enabling end-to-end reranking in a single LLM forward … view at source ↗

**Figure 3.** Figure 3: Parameter study on the effect of image token keep ratio ρ on reranking effectiveness (Recall@1, 3, 5) and latency (LLM Time in ms) on first stage results from DSEwiki−ss. 10 20 30 40 Number of Candidates (k) 55 60 65 70 75 80 85 90 95 Recall (%) Recall@1 Recall@3 Recall@5 LLM Time 100 200 300 400 500 600 700 LLM Time (ms) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Parameter study on the effect of the number of input passages k on reranking effectiveness (Recall@1, 3, 5) and latency (LLM Time in ms) on first stage results from DSEwiki−ss. with ColQwen. ZipRerank−50% remains strong despite retaining only half of the visual tokens, suggesting that the efficiency gain is not specific to MMDocIR. Pointwise LamRA achieves the highest score, but requires candidate-wise s… view at source ↗

**Figure 5.** Figure 5: The reranking input prompt template and example target generation sequence for ZipRerank with Qwen3-VL. 2. Ranking Loss: Stage 1 uses weighted RankNet while Stage 2 uses a soft ranking loss with position-decayed target distribution. Stage 1: Pre-training on Rendered Text. In the first stage, we train the model on the RankZephyr dataset (Pradeep et al., 2023b), which contains text passages with relevance la… view at source ↗

**Figure 6.** Figure 6: Parameter studies on first-stage results from ColQwen [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

read the original abstract

Listwise reranking is a key yet computationally expensive component in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. While recent VLM-based rerankers achieve strong accuracy, their practicality is often limited by long visual-token sequences and multi-step autoregressive decoding. We propose ZipRerank, a highly efficient listwise multimodal reranker that directly addresses both bottlenecks. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. To enable effective learning, ZipRerank adopts a two-stage training strategy: (i) listwise pretraining on large-scale text data rendered as images, and (ii) multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Extensive experiments on the MMDocIR benchmark show that ZipRerank matches or surpasses state-of-the-art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well-suited for latency-sensitive real-world systems. The code is available at https://github.com/dukesun99/ZipRerank.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZipRerank gets listwise multimodal reranking down to much lower latency via early query-image interaction and single-pass scoring, but the rendered-text pretraining step leaves open questions about transfer to real documents.

read the letter

The main thing to know is that ZipRerank combines a lightweight early interaction between query and image to shorten the input sequence with a single forward pass that scores the whole list at once. This directly targets the latency problem in VLM rerankers for long multimodal documents, and the reported results on MMDocIR show it matching or beating prior methods while cutting inference time by up to an order of magnitude. The two-stage training recipe—listwise pretraining on rendered text images followed by VLM-teacher distillation—is a concrete way to bootstrap the model without needing enormous amounts of native multimodal data upfront, and releasing the code makes it easier to check the implementation details.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ZipRerank, a listwise multimodal reranker for long documents that shortens visual input via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in one forward pass. It uses a two-stage training process consisting of listwise pretraining on large-scale text rendered as images followed by finetuning with soft-ranking labels distilled from a VLM teacher. On the MMDocIR benchmark, ZipRerank is reported to match or exceed state-of-the-art multimodal rerankers while cutting LLM inference latency by up to an order of magnitude.

Significance. If the empirical claims hold under scrutiny, the work is significant for practical vision-centric retrieval and M-RAG pipelines, where listwise reranking over long multimodal documents has been limited by high latency. The open-sourced code at the provided GitHub link strengthens reproducibility and potential adoption.

major comments (2)

[Abstract and §3] Abstract and §3 (two-stage training): the claim that listwise pretraining on rendered text images followed by VLM-distilled finetuning preserves ranking quality on real long multimodal documents is load-bearing for the MMDocIR results, yet no ablations are described that test transfer to documents containing non-textual visuals (charts, photos, complex layouts) absent from the rendered-text pretraining data.
[Experiments] Experiments section: aggregate benchmark results are presented without reported statistical significance tests, exact train/validation/test splits for MMDocIR, or implementation details of the early-interaction module and single-pass scorer, which are required to substantiate the latency gains and performance parity with SOTA rerankers.

minor comments (1)

The abstract states that code is available at https://github.com/dukesun99/ZipRerank; confirming that the repository includes the exact training scripts and model checkpoints used for the reported numbers would aid verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of our training strategy and experimental reporting that we will address to strengthen the paper. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (two-stage training): the claim that listwise pretraining on rendered text images followed by VLM-distilled finetuning preserves ranking quality on real long multimodal documents is load-bearing for the MMDocIR results, yet no ablations are described that test transfer to documents containing non-textual visuals (charts, photos, complex layouts) absent from the rendered-text pretraining data.

Authors: We acknowledge the value of explicit ablations to support the transfer claim. The finetuning stage is performed on the MMDocIR training set, which contains real long documents with charts, photos, and complex layouts, allowing adaptation beyond text-only pretraining. To directly test preservation of ranking quality, we will add ablation studies in the revised manuscript comparing models with and without the rendered-text pretraining stage, evaluated on MMDocIR subsets stratified by visual complexity (e.g., text-heavy vs. chart/photo-heavy documents). These results will be reported in §3 and the experiments section. revision: yes
Referee: [Experiments] Experiments section: aggregate benchmark results are presented without reported statistical significance tests, exact train/validation/test splits for MMDocIR, or implementation details of the early-interaction module and single-pass scorer, which are required to substantiate the latency gains and performance parity with SOTA rerankers.

Authors: We agree these details are essential for reproducibility and validating the claims. In the revised manuscript, we will update the Experiments section to report: (i) statistical significance tests (e.g., paired bootstrap or t-tests with p-values) for all main results against baselines; (ii) the exact train/validation/test splits used from MMDocIR; and (iii) expanded implementation details, including architecture specifications, hyperparameters, and pseudocode for the query-image early interaction module and single-pass scorer (moved to an appendix if needed for space). The open-sourced code already implements these, and we will cross-reference it explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture and benchmark results

full rationale

The paper describes an engineering proposal (lightweight query-image interaction plus single-pass scoring) trained via a two-stage process (rendered-text pretraining followed by VLM distillation) and validated through direct comparisons on the MMDocIR benchmark. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters, self-citations, or renamed inputs. All load-bearing claims rest on external experimental measurements rather than internal tautologies.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach assumes standard VLM capabilities for image rendering and distillation; no explicit free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption Vision-language models can effectively process rendered text images for pretraining
Invoked in the listwise pretraining stage on text data rendered as images.
domain assumption VLM-teacher soft rankings provide useful supervision for multimodal finetuning
Used in the second training stage.

pith-pipeline@v0.9.0 · 5507 in / 1309 out tokens · 59238 ms · 2026-05-13T05:12:18.624791+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 11 internal anchors

[1]

FIRST : Faster Improved Listwise Reranking with Single Token Decoding

Gangi Reddy, Revanth and Doo, JaeHyeok and Xu, Yifei and Sultan, Md Arafat and Swain, Deevya and Sil, Avirup and Ji, Heng. FIRST : Faster Improved Listwise Reranking with Single Token Decoding. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.491

work page doi:10.18653/v1/2024.emnlp-main.491 2024
[2]

arXiv preprint arXiv:2411.05508 , year=

An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking , author=. arXiv preprint arXiv:2411.05508 , year=

work page arXiv
[3]

RankZephyr: Effective and robust zero-shot listwise reranking is a breeze!arXiv preprint arXiv:2312.02724, 2023

RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! , author=. arXiv preprint arXiv:2312.02724 , year=

work page arXiv
[4]

Proceedings of the 22nd international conference on Machine learning , pages=

Learning to rank using gradient descent , author=. Proceedings of the 22nd international conference on Machine learning , pages=

work page
[5]

MMD oc IR : Benchmarking Multimodal Retrieval for Long Documents

Dong, Kuicai and Chang, Yujing and Goh Xin Deik, Derrick and Li, Dexun and Tang, Ruiming and Liu, Yong. MMD oc IR : Benchmarking Multimodal Retrieval for Long Documents. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

work page 2025
[6]

arXiv preprint arXiv:2506.12364 , year=

MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval , author=. arXiv preprint arXiv:2506.12364 , year=

work page arXiv
[7]

Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding , author=. arXiv preprint arXiv:2510.15253 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings

Ma, Yubo and Li, Jinsong and Zang, Yuhang and Wu, Xiaobao and Dong, Xiaoyi and Zhang, Pan and Cao, Yuhang and Duan, Haodong and Wang, Jiaqi and Cao, Yixin and Sun, Aixin. Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi...

work page doi:10.18653/v1/2025.findings-acl.1003 2025
[9]

Pattern Recognition , volume=

Hierarchical multimodal transformers for multipage docvqa , author=. Pattern Recognition , volume=

work page
[10]

Proceedings of the AAAI Conference on Artificial Intelligence , pages =

Ryota Tanaka and Kyosuke Nishida and Kosuke Nishida and Taku Hasegawa and Itsumi Saito and Kuniko Saito , title =. Proceedings of the AAAI Conference on Artificial Intelligence , pages =

work page
[11]

Towards Complex Document Understanding By Discrete Reasoning , booktitle =

Fengbin Zhu and Wenqiang Lei and Fuli Feng and Chao Wang and Haozhou Zhang and Tat. Towards Complex Document Understanding By Discrete Reasoning , booktitle =

work page
[12]

2024 , eprint=

SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation , author=. 2024 , eprint=

work page 2024
[13]

Document Understanding Dataset and Evaluation

Jordy Van Landeghem and Rafal Powalski and Rub. Document Understanding Dataset and Evaluation

work page
[14]

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks , year =

Dan Hendrycks and Collin Burns and Anya Chen and Spencer Ball , title =. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks , year =

work page
[15]

Unifying Multimodal Retrieval via Document Screenshot Embedding

Ma, Xueguang and Lin, Sheng-Chieh and Li, Minghan and Chen, Wenhu and Lin, Jimmy. Unifying Multimodal Retrieval via Document Screenshot Embedding. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024

work page 2024
[16]

The Thirteenth International Conference on Learning Representations , year=

ColPali: Efficient Document Retrieval with Vision Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[17]

Qwen3-VL Technical Report

Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

2013 12th International Conference on Document Analysis and Recognition , pages=

Multi-modal information integration for document retrieval , author=. 2013 12th International Conference on Document Analysis and Recognition , pages=. 2013 , organization=

work page 2013
[19]

arXiv preprint arXiv:2410.02729 , year=

Unified Multimodal Interleaved Document Representation for Retrieval , author=. arXiv preprint arXiv:2410.02729 , year=

work page arXiv
[20]

ACM Computing Surveys , volume=

Visual question answering: A survey of methods, datasets, evaluation, and challenges , author=. ACM Computing Surveys , volume=

work page
[21]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Retrieving Multimodal Information for Augmented Generation: A Survey , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023
[22]

arXiv preprint arXiv:2504.08748 , year=

A survey of multimodal retrieval-augmented generation , author=. arXiv preprint arXiv:2504.08748 , year=

work page arXiv
[23]

Proceedings of the IEEE international conference on computer vision , pages=

Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[24]

European Conference on Information Retrieval , pages=

Cross-modal retrieval for knowledge-based visual question answering , author=. European Conference on Information Retrieval , pages=. 2024 , organization=

work page 2024
[25]

IEEE Access , volume=

The state of the art for cross-modal retrieval: A survey , author=. IEEE Access , volume=

work page
[26]

Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

Mmsearch: Benchmarking the potential of large models as multi-modal search engines , author=. arXiv preprint arXiv:2409.12959 , year=

work page arXiv
[27]

arXiv preprint arXiv:2407.21439 , year=

Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training , author=. arXiv preprint arXiv:2407.21439 , year=

work page arXiv
[28]

arXiv preprint arXiv:2501.04695 , year=

Re-ranking the context for multimodal retrieval augmented generation , author=. arXiv preprint arXiv:2501.04695 , year=

work page arXiv
[29]

arXiv preprint arXiv:2507.20198 , year=

When tokens talk too much: A survey of multimodal long-context token compression across images, videos, and audios , author=. arXiv preprint arXiv:2507.20198 , year=

work page arXiv
[30]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Voco-llama: Towards vision compression with large language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[31]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Visionzip: Longer is better but not necessary in vision language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[32]

The Eleventh International Conference on Learning Representations , year=

Token Merging: Your ViT But Faster , author=. The Eleventh International Conference on Learning Representations , year=

work page
[33]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[34]

Forty-second International Conference on Machine Learning , year=

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference , author=. Forty-second International Conference on Machine Learning , year=

work page
[35]

and Wang, Benyou

Song, Dingjie and Wang, Wenjun and Chen, Shunian and Wang, Xidong and Guan, Michael X. and Wang, Benyou. Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLM s. Proceedings of the 31st International Conference on Computational Linguistics. 2025

work page 2025
[36]

arXiv preprint arXiv:2501.09532 , year=

AdaFV: Rethinking of Visual-Language alignment for VLM acceleration , author=. arXiv preprint arXiv:2501.09532 , year=

work page arXiv
[37]

European Conference on Computer Vision , pages=

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[38]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

work page
[39]

RoFormer: Enhanced Transformer with Rotary Position Embedding,

RoFormer: Enhanced transformer with Rotary Position Embedding , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.neucom.2023.127063 , url =

work page doi:10.1016/j.neucom.2023.127063 2024
[40]

arXiv preprint arXiv:2508.07995 , year=

Diver: A multi-stage approach for reasoning-intensive information retrieval , author=. arXiv preprint arXiv:2508.07995 , year=

work page arXiv
[41]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. arXiv preprint arXiv:2506.05176 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

Rankllm: A python package for reranking with llms , author=. Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

work page
[43]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[44]

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

Mteb: Massive text embedding benchmark , author=. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

work page
[45]

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models , author=. arXiv preprint arXiv:2104.08663 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Passage Re-ranking with BERT

Passage Re-ranking with BERT , author=. arXiv preprint arXiv:1901.04085 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1901
[47]

A o E : Angle-optimized Embeddings for Semantic Textual Similarity

Li, Xianming and Li, Jing. A o E : Angle-optimized Embeddings for Semantic Textual Similarity. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). 2024

work page 2024
[48]

Reimers, Nils and Gurevych, Iryna , booktitle=

work page
[49]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Text embeddings by weakly-supervised contrastive pre-training , author=. arXiv preprint arXiv:2212.03533 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation , author=. arXiv preprint arXiv:2402.03216 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

IEEE Transactions on Knowledge and Data Engineering , volume=

Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement , author=. IEEE Transactions on Knowledge and Data Engineering , volume=

work page
[52]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Dense Passage Retrieval for Open-Domain Question Answering , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

work page 2020
[53]

arXiv preprint arXiv:2309.15088 , year=

Rankvicuna: Zero-shot listwise document reranking with open-source large language models , author=. arXiv preprint arXiv:2309.15088 , year=

work page arXiv
[54]

ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability

Reasonrank: Empowering passage ranking with strong reasoning ability , author=. arXiv preprint arXiv:2508.07050 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

European Conference on Information Retrieval , pages=

Guiding retrieval using llm-based listwise rankers , author=. European Conference on Information Retrieval , pages=. 2025 , organization=

work page 2025
[56]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Flashattention-2: Faster attention with better parallelism and work partitioning , author=. arXiv preprint arXiv:2307.08691 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval , pages=

Colbert: Efficient and effective passage search via contextualized late interaction over bert , author=. Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval , pages=

work page
[59]

Advances in Neural Information Processing Systems , volume=

H2o: Heavy-hitter oracle for efficient generative inference of large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[60]

Advances in Neural Information Processing Systems , volume=

Snapkv: Llm knows what you are looking for before generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[61]

arXiv preprint arXiv:2406.10774 , year=

Quest: Query-aware sparsity for efficient long-context llm inference , author=. arXiv preprint arXiv:2406.10774 , year=

work page arXiv
[62]

The Thirteenth International Conference on Learning Representations , year=

Omnikv: Dynamic context selection for efficient long-context llms , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[63]

Proceedings of the ACM on Management of Data , volume=

Pqcache: Product quantization-based kvcache for long context llm inference , author=. Proceedings of the ACM on Management of Data , volume=

work page
[64]

Proceedings of the VLDB Endowment , volume=

DiversiNews: Enriching News Consumption with Relevant Yet Diverse News Articles Retrieval , author=. Proceedings of the VLDB Endowment , volume=

work page
[65]

The Thirteenth International Conference on Learning Representations (ICLR) , year=

A General Framework for Producing Interpretable Semantic Text Embeddings , author=. The Thirteenth International Conference on Learning Representations (ICLR) , year=

work page
[66]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

PRISM: A Framework for Producing Interpretable Political Bias Embeddings with Political-Aware Cross-Encoder , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=. 2025 , url=

work page 2025
[67]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Don't Reinvent the Wheel: Efficient Instruction-Following Text Embedding based on Guided Space Transformation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=. 2025 , url=

work page 2025
[68]

arXiv preprint arXiv:2506.08354 , year=

Text embeddings should capture implicit semantics, not just surface meaning , author=. arXiv preprint arXiv:2506.08354 , year=

work page arXiv
[69]

arXiv preprint arXiv:2512.00852 , year=

One Swallow Does Not Make a Summer: Understanding Semantic Structures in Embedding Spaces , author=. arXiv preprint arXiv:2512.00852 , year=

work page arXiv
[70]

2017 IEEE 33rd International Conference on Data Engineering (ICDE) , pages=

Reverse query-aware locality-sensitive hashing for high-dimensional furthest neighbor search , author=. 2017 IEEE 33rd International Conference on Data Engineering (ICDE) , pages=

work page 2017
[71]

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) , pages=

Accurate and fast asymmetric locality-sensitive hashing scheme for maximum inner product search , author=. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) , pages=

work page
[72]

International Conference on Machine Learning , pages=

Sublinear time nearest neighbor search over generalized weighted space , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[73]

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD) , pages=

Locality-sensitive hashing scheme based on longest circular co-substring , author=. Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD) , pages=

work page 2020
[74]

Proceedings of the 2021 International Conference on Management of Data (SIGMOD) , pages=

Point-to-hyperplane nearest neighbor search beyond the unit hypersphere , author=. Proceedings of the 2021 International Conference on Management of Data (SIGMOD) , pages=

work page 2021
[75]

Huang, Qiang and Wang, Yanhao and Tung, Anthony KH , booktitle=

work page
[76]

2023 IEEE 39th International Conference on Data Engineering (ICDE) , pages=

Lightweight-yet-efficient: Revitalizing ball-tree for point-to-hyperplane nearest neighbor search , author=. 2023 IEEE 39th International Conference on Data Engineering (ICDE) , pages=

work page 2023
[77]

arXiv preprint arXiv:2402.13858 , year=

Diversity-Aware k -Maximum Inner Product Search Revisited , author=. arXiv preprint arXiv:2402.13858 , year=

work page arXiv
[78]

Proceedings of the AAAI Conference on Artificial Intelligence , pages=

Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=

work page
[79]

arXiv preprint arXiv:2603.06213 , year=

Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events , author=. arXiv preprint arXiv:2603.06213 , year=

work page arXiv
[80]

MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation

MG ^2 -RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation , author=. arXiv preprint arXiv:2604.04969 , year=

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.