arXiv preprint arXiv:2502.09927 , year=

Team, G · 2025 · arXiv 2502.09927

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

baseline 2

citation-polarity summary

baseline 2

representative citing papers

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

cs.CV · 2026-05-20 · conditional · novelty 7.0

WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.

Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.

ParseBench: A Document Parsing Benchmark for AI Agents

cs.CV · 2026-04-09 · accept · novelty 7.0

ParseBench is a new benchmark for document parsing in AI agents that reveals fragmented performance across five semantic dimensions with LlamaParse Agentic scoring highest at 84.9%.

ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding

cs.CV · 2026-03-28 · unverdicted · novelty 7.0

ChartNet is a million-scale multimodal dataset for chart understanding created via code-guided synthesis spanning 24 chart types with five aligned modalities per sample.

POTATR: A Lightweight Image-to-Graph Model for Page-Level Table Extraction

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

POTATR extends TATR into a 29M-parameter image-to-graph model for contextual page-level table extraction, reporting 0.964 GriTS_Con on PubTables-v2 Single Pages while running 130x faster and 300x cheaper than tested alternatives including MLLMs.

Building a Precise Video Language with Human-AI Oversight

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval cs.CV · 2026-05-08 · unverdicted · none · ref 36
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding cs.CV · 2026-03-28 · unverdicted · none · ref 51
ChartNet is a million-scale multimodal dataset for chart understanding created via code-guided synthesis spanning 24 chart types with five aligned modalities per sample.
POTATR: A Lightweight Image-to-Graph Model for Page-Level Table Extraction cs.CV · 2026-06-08 · unverdicted · none · ref 26
POTATR extends TATR into a 29M-parameter image-to-graph model for contextual page-level table extraction, reporting 0.964 GriTS_Con on PubTables-v2 Single Pages while running 130x faster and 300x cheaper than tested alternatives including MLLMs.
Building a Precise Video Language with Human-AI Oversight cs.CV · 2026-04-22 · unverdicted · none · ref 59
CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.

arXiv preprint arXiv:2502.09927 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer