PixelRAG shows that operating RAG entirely over web screenshots outperforms text-based retrieval on NQ, SimpleQA, MMSearch, LiveVQA, and MoNaCo, with up to 18.1% accuracy gains and 3x token savings via image compression.
hub
Vidore benchmark v2: Raising the bar for visual retrieval
19 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 19representative citing papers
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
Prune-then-Merge combines adaptive pruning of low-signal patches with hierarchical merging to achieve higher compression rates and better performance than prior single-stage methods in visual document retrieval.
GQR is a test-time optimization technique that refines primary retriever query embeddings using complementary retriever scores to achieve high performance with smaller representations in multimodal visual document retrieval.
MM-Matryoshka is a 2D Matryoshka training framework enabling budget-elastic ColPali-style multi-vector visual document retrieval along dimension and layer without separate models per budget.
MM-BizRAG applies layout-aware document splitting and decoupled multimodal assembly to raise generative recall on enterprise Q&A tasks by up to 32 points over vision-centric baselines while adding FastRAGEval as a cheaper LLM judge.
GELATO extends frozen Jina Embeddings v5 text models with locked non-text encoders, training only connectors to produce competitive multimodal embeddings while preserving exact text performance.
MINER fuses internal transformer layer representations via probing and adaptive sparse fusion to improve dense single-vector retrieval quality on visual documents by up to 4.5% nDCG@5 while preserving efficiency.
LFRAG advances multimodal RAG to block-level retrieval with layout segmentation and cross-attention fusion, reporting SOTA retrieval, 7.20% higher answer accuracy, and 73.07% lower token consumption on the new LFDocQA benchmark.
ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.
DocRetriever introduces a framework using layout-aware sparse embeddings for hybrid encoding without OCR and a generalizable reasoning-augmented reranker for few-shot settings, plus the MultiDocR benchmark for evaluation.
Hard maximum similarity pooling in late-interaction models induces higher patch-level gradient concentration and greater length sensitivity than top-k or softmax alternatives.
AGREE boosts visual document retrieval by adding local relevance signals from MLLM attention maps to global document labels during retriever training.
VLM2Vec-V2 is a multimodal embedding model trained on an extended MMEB-V2 benchmark that adds video and visual document tasks and reports gains on both new and prior image benchmarks.
A native multimodal embedding model from Gemini achieves reported state-of-the-art results on retrieval benchmarks across modalities via large-scale contrastive learning.
citing papers explorer
-
PIXELRAG: Web Screenshots Beat Text for Retrieval-Augmented Generation
PixelRAG shows that operating RAG entirely over web screenshots outperforms text-based retrieval on NQ, SimpleQA, MMSearch, LiveVQA, and MoNaCo, with up to 18.1% accuracy gains and 3x token savings via image compression.
-
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
-
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
-
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.
-
LMEB: Long-horizon Memory Embedding Benchmark
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
-
Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework
Prune-then-Merge combines adaptive pruning of low-signal patches with hierarchical merging to achieve higher compression rates and better performance than prior single-stage methods in visual document retrieval.
-
Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
GQR is a test-time optimization technique that refines primary retriever query embeddings using complementary retriever scores to achieve high performance with smaller representations in multimodal visual document retrieval.
-
MM-Matryoshka: Towards Budget-Elastic Visual Document Retrieval via a 2D Multimodal Matryoshka Training Framework
MM-Matryoshka is a 2D Matryoshka training framework enabling budget-elastic ColPali-style multi-vector visual document retrieval along dimension and layer without separate models per budget.
-
MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A
MM-BizRAG applies layout-aware document splitting and decoupled multimodal assembly to raise generative recall on enterprise Q&A tasks by up to 32 points over vision-centric baselines while adding FastRAGEval as a cheaper LLM judge.
-
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
GELATO extends frozen Jina Embeddings v5 text models with locked non-text encoders, training only connectors to produce competitive multimodal embeddings while preserving exact text performance.
-
MINER: Mining Multimodal Internal Representation for Efficient Retrieval
MINER fuses internal transformer layer representations via probing and adaptive sparse fusion to improve dense single-vector retrieval quality on visual documents by up to 4.5% nDCG@5 while preserving efficiency.
-
LFRAG: Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding
LFRAG advances multimodal RAG to block-level retrieval with layout segmentation and cross-attention fusion, reporting SOTA retrieval, 7.20% higher answer accuracy, and 73.07% lower token consumption on the new LFDocQA benchmark.
-
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
-
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction
MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.
-
DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark
DocRetriever introduces a framework using layout-aware sparse embeddings for hybrid encoding without OCR and a generalizable reasoning-augmented reranker for few-shot settings, plus the MultiDocR benchmark for evaluation.
-
Spike Hijacking in Late-Interaction Retrieval
Hard maximum similarity pooling in late-interaction models induces higher patch-level gradient concentration and greater length sensitivity than top-k or softmax alternatives.
-
Attention Grounded Enhancement for Visual Document Retrieval
AGREE boosts visual document retrieval by adding local relevance signals from MLLM attention maps to global document labels during retriever training.
-
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
VLM2Vec-V2 is a multimodal embedding model trained on an extended MMEB-V2 benchmark that adds video and visual document tasks and reports gains on both new and prior image benchmarks.
-
Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini
A native multimodal embedding model from Gemini achieves reported state-of-the-art results on retrieval benchmarks across modalities via large-scale contrastive learning.