ChinaHeritaQA is a new bilingual VQA benchmark dataset with 2,279 images and 14,133 QA pairs for evaluating cultural reasoning abilities of VLMs on Chinese World Heritage sites across seven cognitive dimensions.
Title resolution pending
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 8verdicts
UNVERDICTED 8roles
background 1polarities
background 1representative citing papers
BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.
MemQ improves LLM agent performance by using eligibility traces over provenance DAGs to assign credit to dependent memories, achieving top success rates on six benchmarks with largest gains on complex multi-step tasks.
SASAV introduces the first fully autonomous multi-agent system for scientific data analysis and visualization that operates without external prompting or human-in-the-loop feedback.
Introduces CulMind benchmark, CulMind-R reasoning subset, and ReaScore metric to evaluate MLLMs on Chinese cultural heritage multimodal understanding and reasoning quality.
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.
RAVE is a lightweight pair-gating addition to self-attention that improves visual token allocation in LMMs and delivers an average 3-point gain on multimodal benchmarks, largest on perception-heavy tasks.
citing papers explorer
-
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.