A new cross-cultural benchmark shows vision-language models infer structured cultural metadata from images inconsistently, with fragmented signals and large performance gaps across regions and metadata types.
arXiv preprint arXiv:2507.21917 (2025)
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Token Activation Maps applied to MLLM art descriptions reveal that visual grounding strength varies by token category, with better artist identification than title prediction.
MLLMs exhibit a consistent recognition-reasoning inversion on discrete visual symbols across domains, underperforming on elementary perception while appearing competent on higher-level reasoning via linguistic compensation.
citing papers explorer
-
Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
A new cross-cultural benchmark shows vision-language models infer structured cultural metadata from images inconsistently, with fragmented signals and large performance gaps across regions and metadata types.
-
Understanding How MLLMs Describe Artworks Using Token Activation Maps
Token Activation Maps applied to MLLM art descriptions reveal that visual grounding strength varies by token category, with better artist identification than title prediction.
-
Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding
MLLMs exhibit a consistent recognition-reasoning inversion on discrete visual symbols across domains, underperforming on elementary perception while appearing competent on higher-level reasoning via linguistic compensation.