WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.
arXiv preprint arXiv:2502.09927 , year=
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 6years
2026 6roles
baseline 2polarities
baseline 2representative citing papers
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
ParseBench is a new benchmark for document parsing in AI agents that reveals fragmented performance across five semantic dimensions with LlamaParse Agentic scoring highest at 84.9%.
ChartNet is a million-scale multimodal dataset for chart understanding created via code-guided synthesis spanning 24 chart types with five aligned modalities per sample.
POTATR extends TATR into a 29M-parameter image-to-graph model for contextual page-level table extraction, reporting 0.964 GriTS_Con on PubTables-v2 Single Pages while running 130x faster and 300x cheaper than tested alternatives including MLLMs.
CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.
citing papers explorer
-
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
-
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
ChartNet is a million-scale multimodal dataset for chart understanding created via code-guided synthesis spanning 24 chart types with five aligned modalities per sample.
-
POTATR: A Lightweight Image-to-Graph Model for Page-Level Table Extraction
POTATR extends TATR into a 29M-parameter image-to-graph model for contextual page-level table extraction, reporting 0.964 GriTS_Con on PubTables-v2 Single Pages while running 130x faster and 300x cheaper than tested alternatives including MLLMs.
-
Building a Precise Video Language with Human-AI Oversight
CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.