hub

Vidore benchmark v2: Raising the bar for visual retrieval

Manuel Faysse · 2025 · arXiv 2505.17166

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

PIXELRAG: Web Screenshots Beat Text for Retrieval-Augmented Generation

cs.IR · 2026-06-01 · unverdicted · novelty 7.0

PixelRAG shows that operating RAG entirely over web screenshots outperforms text-based retrieval on NQ, SimpleQA, MMSearch, LiveVQA, and MoNaCo, with up to 18.1% accuracy gains and 3x token savings via image compression.

Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

cs.IR · 2026-04-25 · unverdicted · novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

cs.CV · 2026-04-11 · unverdicted · novelty 7.0

ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.

LMEB: Long-horizon Memory Embedding Benchmark

cs.CL · 2026-03-13 · unverdicted · novelty 7.0

LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.

Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework

cs.CL · 2026-02-23 · unverdicted · novelty 7.0

Prune-then-Merge combines adaptive pruning of low-signal patches with hierarchical merging to achieve higher compression rates and better performance than prior single-stage methods in visual document retrieval.

Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization

cs.CL · 2025-10-06 · unverdicted · novelty 7.0

GQR is a test-time optimization technique that refines primary retriever query embeddings using complementary retriever scores to achieve high performance with smaller representations in multimodal visual document retrieval.

MM-Matryoshka: Towards Budget-Elastic Visual Document Retrieval via a 2D Multimodal Matryoshka Training Framework

cs.CV · 2026-06-03 · unverdicted · novelty 6.0

MM-Matryoshka is a 2D Matryoshka training framework enabling budget-elastic ColPali-style multi-vector visual document retrieval along dimension and layer without separate models per budget.

MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

MM-BizRAG applies layout-aware document splitting and decoupled multimodal assembly to raise generative recall on enterprise Q&A tasks by up to 32 points over vision-centric baselines while adding FastRAGEval as a cheaper LLM judge.

jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 3 refs

GELATO extends frozen Jina Embeddings v5 text models with locked non-text encoders, training only connectors to produce competitive multimodal embeddings while preserving exact text performance.

MINER: Mining Multimodal Internal Representation for Efficient Retrieval

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

MINER fuses internal transformer layer representations via probing and adaptive sparse fusion to improve dense single-vector retrieval quality on visual documents by up to 4.5% nDCG@5 while preserving efficiency.

LFRAG: Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding

cs.IR · 2026-04-18 · unverdicted · novelty 6.0

LFRAG advances multimodal RAG to block-level retrieval with layout segmentation and cross-attention fusion, reporting SOTA retrieval, 7.20% higher answer accuracy, and 73.07% lower token consumption on the new LFDocQA benchmark.

ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

cs.IR · 2026-04-08 · unverdicted · novelty 6.0

ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

cs.IR · 2025-09-22 · unverdicted · novelty 6.0

MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.

DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark

cs.CV · 2026-05-28 · unverdicted · novelty 5.0

DocRetriever introduces a framework using layout-aware sparse embeddings for hybrid encoding without OCR and a generalizable reasoning-augmented reranker for few-shot settings, plus the MultiDocR benchmark for evaluation.

Spike Hijacking in Late-Interaction Retrieval

cs.IR · 2026-04-06 · unverdicted · novelty 5.0

Hard maximum similarity pooling in late-interaction models induces higher patch-level gradient concentration and greater length sensitivity than top-k or softmax alternatives.

Attention Grounded Enhancement for Visual Document Retrieval

cs.IR · 2025-11-17 · unverdicted · novelty 5.0

AGREE boosts visual document retrieval by adding local relevance signals from MLLM attention maps to global document labels during retriever training.

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

cs.CV · 2025-07-07 · unverdicted · novelty 5.0

VLM2Vec-V2 is a multimodal embedding model trained on an extended MMEB-V2 benchmark that adds video and visual document tasks and reports gains on both new and prior image benchmarks.

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

cs.CV · 2026-05-26 · unverdicted · novelty 4.0

A native multimodal embedding model from Gemini achieves reported state-of-the-art results on retrieval benchmarks across modalities via large-scale contrastive learning.

citing papers explorer

Showing 19 of 19 citing papers.

PIXELRAG: Web Screenshots Beat Text for Retrieval-Augmented Generation cs.IR · 2026-06-01 · unverdicted · none · ref 45
PixelRAG shows that operating RAG entirely over web screenshots outperforms text-based retrieval on NQ, SimpleQA, MMSearch, LiveVQA, and MoNaCo, with up to 18.1% accuracy gains and 3x token savings via image compression.
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval cs.CV · 2026-05-08 · unverdicted · none · ref 8
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models cs.IR · 2026-04-25 · unverdicted · none · ref 22
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval cs.CV · 2026-04-11 · unverdicted · none · ref 24
ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.
LMEB: Long-horizon Memory Embedding Benchmark cs.CL · 2026-03-13 · unverdicted · none · ref 21
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework cs.CL · 2026-02-23 · unverdicted · none · ref 6
Prune-then-Merge combines adaptive pruning of low-signal patches with hierarchical merging to achieve higher compression rates and better performance than prior single-stage methods in visual document retrieval.
Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization cs.CL · 2025-10-06 · unverdicted · none · ref 22
GQR is a test-time optimization technique that refines primary retriever query embeddings using complementary retriever scores to achieve high performance with smaller representations in multimodal visual document retrieval.
MM-Matryoshka: Towards Budget-Elastic Visual Document Retrieval via a 2D Multimodal Matryoshka Training Framework cs.CV · 2026-06-03 · unverdicted · none · ref 78
MM-Matryoshka is a 2D Matryoshka training framework enabling budget-elastic ColPali-style multi-vector visual document retrieval along dimension and layer without separate models per budget.
MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A cs.CL · 2026-06-02 · unverdicted · none · ref 3
MM-BizRAG applies layout-aware document splitting and decoupled multimodal assembly to raise generative recall on enterprise Q&A tasks by up to 32 points over vision-centric baselines while adding FastRAGEval as a cheaper LLM judge.
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers cs.CL · 2026-05-08 · unverdicted · none · ref 30 · 3 links
GELATO extends frozen Jina Embeddings v5 text models with locked non-text encoders, training only connectors to produce competitive multimodal embeddings while preserving exact text performance.
MINER: Mining Multimodal Internal Representation for Efficient Retrieval cs.LG · 2026-05-07 · unverdicted · none · ref 19
MINER fuses internal transformer layer representations via probing and adaptive sparse fusion to improve dense single-vector retrieval quality on visual documents by up to 4.5% nDCG@5 while preserving efficiency.
LFRAG: Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding cs.IR · 2026-04-18 · unverdicted · none · ref 22
LFRAG advances multimodal RAG to block-level retrieval with layout segmentation and cross-attention fusion, reporting SOTA retrieval, 7.20% higher answer accuracy, and 73.07% lower token consumption on the new LFDocQA benchmark.
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment cs.IR · 2026-04-08 · unverdicted · none · ref 51
ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction cs.IR · 2025-09-22 · unverdicted · none · ref 44
MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.
DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark cs.CV · 2026-05-28 · unverdicted · none · ref 46
DocRetriever introduces a framework using layout-aware sparse embeddings for hybrid encoding without OCR and a generalizable reasoning-augmented reranker for few-shot settings, plus the MultiDocR benchmark for evaluation.
Spike Hijacking in Late-Interaction Retrieval cs.IR · 2026-04-06 · unverdicted · none · ref 11
Hard maximum similarity pooling in late-interaction models induces higher patch-level gradient concentration and greater length sensitivity than top-k or softmax alternatives.
Attention Grounded Enhancement for Visual Document Retrieval cs.IR · 2025-11-17 · unverdicted · none · ref 34
AGREE boosts visual document retrieval by adding local relevance signals from MLLM attention maps to global document labels during retriever training.
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents cs.CV · 2025-07-07 · unverdicted · none · ref 19
VLM2Vec-V2 is a multimodal embedding model trained on an extended MMEB-V2 benchmark that adds video and visual document tasks and reports gains on both new and prior image benchmarks.
Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini cs.CV · 2026-05-26 · unverdicted · none · ref 35
A native multimodal embedding model from Gemini achieves reported state-of-the-art results on retrieval benchmarks across modalities via large-scale contrastive learning.

Vidore benchmark v2: Raising the bar for visual retrieval

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer