ChinaHeritaQA is a new bilingual VQA benchmark dataset with 2,279 images and 14,133 QA pairs for evaluating cultural reasoning abilities of VLMs on Chinese World Heritage sites across seven cognitive dimensions.
MMMU -Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 8roles
background 1polarities
background 1representative citing papers
BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.
MemQ improves LLM agent performance by using eligibility traces over provenance DAGs to assign credit to dependent memories, achieving top success rates on six benchmarks with largest gains on complex multi-step tasks.
SASAV introduces the first fully autonomous multi-agent system for scientific data analysis and visualization that operates without external prompting or human-in-the-loop feedback.
Introduces CulMind benchmark, CulMind-R reasoning subset, and ReaScore metric to evaluate MLLMs on Chinese cultural heritage multimodal understanding and reasoning quality.
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.
citing papers explorer
-
ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China
ChinaHeritaQA is a new bilingual VQA benchmark dataset with 2,279 images and 14,133 QA pairs for evaluating cultural reasoning abilities of VLMs on Chinese World Heritage sites across seven cognitive dimensions.
-
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.
-
MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs
MemQ improves LLM agent performance by using eligibility traces over provenance DAGs to assign credit to dependent memories, achieving top success rates on six benchmarks with largest gains on complex multi-step tasks.
-
SASAV: Self-Directed Agent for Scientific Analysis and Visualization
SASAV introduces the first fully autonomous multi-agent system for scientific data analysis and visualization that operates without external prompting or human-in-the-loop feedback.
-
CulMind: Benchmarking Multimodal Understanding and Reasoning in Chinese Cultural Heritage
Introduces CulMind benchmark, CulMind-R reasoning subset, and ReaScore metric to evaluate MLLMs on Chinese cultural heritage multimodal understanding and reasoning quality.
-
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
-
AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis
AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.
- RAVE: Re-Allocating Visual Attention in Large Multimodal Models