hub

Col- pali: Efficient document retrieval with vision language mod- els

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo · 2024 · arXiv 2407.01449

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

read on arXiv browse 18 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

MED-VRAG reaches 78.6% average accuracy on four medical QA benchmarks by iteratively retrieving PMC page images with ColQwen2.5 embeddings and a VLM that refines queries over up to three rounds.

Bottleneck Tokens for Unified Multimodal Retrieval

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.

Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

cs.CV · 2026-04-11 · unverdicted · novelty 7.0

ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.

VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-horizon visual reasoning benchmarks.

MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL

cs.IR · 2026-04-08 · unverdicted · novelty 7.0

MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.

PLUME: Latent Reasoning Based Universal Multimodal Embedding

cs.CV · 2026-04-02 · unverdicted · novelty 7.0

PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.

Retrieval from Within: An Intrinsic Capability of Attention-Based Models

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.

Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

cs.CV · 2026-04-24 · unverdicted · novelty 6.0

Rewrite-driven generation with alignment and RL produces shorter, more effective generative multimodal embeddings than CoT methods on retrieval benchmarks.

Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

cs.CL · 2026-04-15 · unverdicted · novelty 6.0

Doc-V* proposes a coarse-to-fine interactive visual reasoning agent for multi-page document VQA that aggregates evidence selectively via semantic retrieval and targeted fetching, outperforming baselines by up to 47.9% on out-of-domain tasks.

SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

cs.CV · 2026-04-15 · conditional · novelty 6.0 · 2 refs

SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.

HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval

cs.IR · 2026-04-08 · unverdicted · novelty 6.0

HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and +14.1 over the best multimodal baseline.

SmolVLM: Redefining small and efficient multimodal models

cs.AI · 2025-04-07 · unverdicted · novelty 6.0

SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.

MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph

cs.CV · 2026-05-11 · unverdicted · novelty 5.0

MicroWorld constructs a multimodal attributed property graph from scientific image-caption data and augments MLLM prompts via retrieval to raise Qwen3-VL-8B performance by 37.5% on MicroVQA and 6% on MicroBench.

LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

cs.CV · 2026-05-07 · unverdicted · novelty 5.0

LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.

AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce

cs.CL · 2026-04-22 · unverdicted · novelty 5.0

AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

cs.AI · 2026-04-14 · unverdicted · novelty 5.0 · 2 refs

DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-efficient resolution allocation.

BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

cs.IR · 2026-04-08 · unverdicted · novelty 5.0

BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retriever at 33.3.

citing papers explorer

Showing 3 of 3 citing papers after filters.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence cs.CL · 2026-05-13 · accept · none · ref 7
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA cs.CL · 2026-04-15 · unverdicted · none · ref 2
Doc-V* proposes a coarse-to-fine interactive visual reasoning agent for multi-page document VQA that aggregates evidence selectively via semantic retrieval and targeted fetching, outperforming baselines by up to 47.9% on out-of-domain tasks.
AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce cs.CL · 2026-04-22 · unverdicted · none · ref 17
AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.

Col- pali: Efficient document retrieval with vision language mod- els

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer