CulMind: Benchmarking Multimodal Understanding and Reasoning in Chinese Cultural Heritage
Pith reviewed 2026-06-26 14:23 UTC · model grok-4.3
The pith
MLLMs exhibit a substantial gap between answer accuracy and reasoning quality on Chinese cultural heritage tasks, with task-adaptive metrics aligning better with expert judgments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CulMind and CulMind-R form a high-quality benchmark for multimodal understanding and reasoning in Chinese cultural heritage, spanning 50 tasks from over 100 museums and a 24-task reasoning subset. ReaScore serves as a task-adaptive metric that evaluates reasoning by automatically weighting task-relevant dimensions. Experiments on 14 leading MLLMs reveal a substantial gap between answers and reasoning, especially on challenging tasks, while task-adaptive dimension selection and weighting better align evaluation results with expert judgments.
What carries the argument
ReaScore, a task-adaptive metric that evaluates reasoning processes by automatically selecting and weighting task-relevant dimensions drawn from visual, textual, stylistic, and historical clues.
If this is right
- Assessment of MLLMs on cultural heritage should prioritize reasoning-process quality alongside final answers.
- Models require targeted improvements in generating complete reasoning chains that integrate multiple clue types.
- Task-adaptive evaluation methods provide a reference that can transfer to other specialized multimodal domains.
- Public release of the benchmark data and scripts enables direct comparison and extension by other researchers.
- Existing benchmarks that ignore reasoning completeness may produce overoptimistic performance estimates.
Where Pith is reading between the lines
- The same gap between answers and reasoning may appear in MLLM evaluations for other cultural or historical domains outside China.
- Adopting dimension-weighted scoring could shift benchmark design in multimodal AI toward process-oriented metrics.
- Extending the adaptive weighting approach to new task sets could test its robustness without retraining models.
- Failure modes identified in challenging tasks might guide data collection for improving model reasoning in niche domains.
Load-bearing premise
The selected 50 tasks and 24-task reasoning subset adequately capture fine-grained reasoning over the relevant clues, and ReaScore's adaptive weighting produces scores that match expert judgments.
What would settle it
Collect expert ratings of reasoning quality on the 24-task subset for model outputs from the 14 MLLMs and check whether ReaScore scores correlate more strongly with those ratings than standard answer-accuracy scores do; misalignment would falsify the central claim.
Figures
read the original abstract
Evaluating Multimodal Large Language Models (MLLMs) in Chinese Cultural Heritage (CCH) requires fine-grained reasoning over visual, textual, stylistic, and historical clues. However, existing CCH benchmarks mainly emphasize final-answer accuracy, while the accuracy and completeness of reasoning processes remain underexplored. To address this gap, we introduce CulMind and CulMind-R: a high-quality benchmark for multimodal CCH covering 50 tasks from collections of more than 100 museums, and a 24-task reasoning subset that adaptively defines task-specific dimensions for reasoning process evaluation. To evaluate reasoning quality, we propose ReaScore, a task-adaptive metric that evaluates reasoning by automatically weighting task-relevant dimensions. Experiments on 14 leading MLLMs reveal a substantial gap between answers and reasoning, especially on challenging tasks. Further analysis shows that task-adaptive dimension selection and weighting better align evaluation results with expert judgments. Overall, our benchmark and metric support a more expert-aligned assessment of CCH understanding and offer a transferable reference for broader evaluations of cultural heritage. We publicly release the data, code, and evaluation scripts at https://github.com/ZevTsao/CulMind to facilitate reproducible research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CulMind, a benchmark with 50 tasks sourced from collections of more than 100 museums for multimodal understanding of Chinese Cultural Heritage (CCH), and CulMind-R, its 24-task reasoning-focused subset that defines task-specific dimensions. It proposes ReaScore, a task-adaptive metric that evaluates reasoning quality by automatically weighting relevant dimensions. Experiments with 14 leading MLLMs demonstrate a substantial gap between final-answer accuracy and reasoning quality (especially on challenging tasks), with analysis showing that the adaptive dimension selection and weighting align evaluation results more closely with expert judgments. The data, code, and evaluation scripts are publicly released.
Significance. If the results hold, the work fills a clear gap in CCH evaluation by shifting focus from answer accuracy alone to fine-grained reasoning over visual, textual, stylistic, and historical clues. The public release of data, code, and evaluation scripts is a notable strength that supports reproducibility and provides a transferable reference for other cultural-heritage benchmarks.
major comments (1)
- [Abstract] Abstract and benchmark-construction description: the claims of a 'substantial gap between answers and reasoning' and 'better align[ment] with expert judgments' rest on the 50-task/24-task split and ReaScore's adaptive weighting, yet no concrete task examples, inter-annotator agreement statistics, dimension definitions, or quantitative alignment metrics (e.g., correlation with experts) are supplied; these details are load-bearing for verifying the central claims.
minor comments (1)
- Clarify the exact procedure for 'automatically weighting task-relevant dimensions' in ReaScore, including any hyperparameters or selection criteria, to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for additional supporting details to substantiate the central claims. We address the comment below and will revise the manuscript accordingly to improve clarity and verifiability.
read point-by-point responses
-
Referee: [Abstract] Abstract and benchmark-construction description: the claims of a 'substantial gap between answers and reasoning' and 'better align[ment] with expert judgments' rest on the 50-task/24-task split and ReaScore's adaptive weighting, yet no concrete task examples, inter-annotator agreement statistics, dimension definitions, or quantitative alignment metrics (e.g., correlation with experts) are supplied; these details are load-bearing for verifying the central claims.
Authors: We agree that the abstract and high-level benchmark description would benefit from more explicit supporting details to allow readers to verify the claims without immediately consulting the full sections or supplementary materials. The full manuscript (Section 3 on benchmark construction and Section 4 on ReaScore) does include task examples (e.g., Table 1 and Figure 2), dimension definitions for the 24-task subset, and the rationale for the 50/24 split, along with the public release of data and code at the GitHub repository. However, inter-annotator agreement statistics for the annotation process and quantitative alignment metrics (such as correlation between ReaScore and expert judgments) are computed in our internal analysis but not prominently reported. We will revise the manuscript to: (1) add 2-3 concrete task examples with their reasoning dimensions directly in the main text or a new appendix subsection; (2) report IAA statistics (e.g., Cohen's kappa or percentage agreement) for dimension annotation; (3) include quantitative alignment results (e.g., Pearson/Spearman correlation with expert ratings) in the analysis section; and (4) briefly reference these in the abstract if space permits. These additions will be made in the revised version. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces CulMind, CulMind-R, and ReaScore as new benchmark and metric contributions without any equations, derivations, or load-bearing self-citations. The 50-task/24-task construction and task-adaptive weighting are presented as independent design choices for expert alignment, with no reduction of predictions or results to fitted inputs or self-referential definitions. The derivation chain is self-contained as a standard benchmark paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing CCH benchmarks mainly emphasize final-answer accuracy while the accuracy and completeness of reasoning processes remain underexplored.
Reference graph
Works this paper leans on
-
[1]
Can Large Language Model Comprehend
Zhang, Yixuan and Li, Haonan , booktitle =. Can Large Language Model Comprehend. 2023 , address =
2023
-
[2]
Zhou, Bo and Chen, Qianglong and Wang, Tianyu and Zhong, Xiaomi and Zhang, Yin , booktitle =. 2023 , address =. doi:10.18653/v1/2023.findings-acl.204 , url =
-
[3]
Wei, Yuting and Xu, Yuanxing and Wei, Xinru and Yang, Simin and Zhu, Yangfu and Li, Yuqing and Liu, Di and Wu, Bin , booktitle =. 2024 , address =. doi:10.18653/v1/2024.findings-emnlp.87 , url =
-
[4]
Cao, Jiahuan and Peng, Dezhi and Zhang, Peirong and Shi, Yongxin and Liu, Yang and Ding, Kai and Jin, Lianwen , booktitle =. 2024 , address =. doi:10.18653/v1/2024.findings-emnlp.243 , url =
-
[5]
2024 , doi =
Cao, Jiahuan and Shi, Yongxin and Peng, Dezhi and Liu, Yang and Jin, Lianwen , journal =. 2024 , doi =
2024
-
[6]
2025 , doi =
Bai, Shuai and Cai, Yuxuan and Chen, Ruizhe and Chen, Keqin and Chen, Xionghui and Cheng, Zesen and Deng, Lianghao and Ding, Wei and Gao, Chang and Ge, Chunjiang and Ge, Wenbin and Guo, Zhifang and Huang, Qidong and Huang, Jie and Huang, Fei and Hui, Binyuan and Jiang, Shutong and Li, Zhaohai and Li, Mingsheng and Li, Mei and Li, Kaixin and Lin, Zicheng a...
2025
-
[7]
2025 , doi =
Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Z...
2025
-
[8]
arXiv preprint arXiv:2507.01006 , year =. doi:10.48550/arXiv.2507.01006 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.01006
-
[9]
2025 , doi =
Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and Wang, Zhaokai and Chen, Zhe and Zhang, Hongjie and Yang, Ganlin and Wang, Haomin and Wei, Qi and Yin, Jinhui and Li, Wenhao and Cui, Erfei and Chen, Guanzhou and Ding, Zichen and Tian, Changy...
2025
-
[10]
MMMU -Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Yue, Xiang and Zheng, Tianyu and Ni, Yuansheng and Wang, Yubo and Zhang, Kai and Tong, Shengbang and Sun, Yuxuan and Yu, Botao and Zhang, Ge and Sun, Huan and Su, Yu and Chen, Wenhu and Neubig, Graham , booktitle =. 2025 , address =. doi:10.18653/v1/2025.acl-long.736 , url =
-
[11]
2025 , note =
Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and Wu, Yunsheng and Ji, Rongrong and Shan, Caifeng and He, Ran , booktitle =. 2025 , note =
2025
-
[12]
2024 , publisher =
Yu, Weihao and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Kevin and Liu, Zicheng and Wang, Xinchao and Wang, Lijuan , booktitle =. 2024 , publisher =
2024
-
[13]
2024 , doi =
Zhang, Ge and Du, Xinrun and Chen, Bei and Liang, Yiming and Luo, Tongxu and Zheng, Tianyu and Zhu, Kang and Cheng, Yuyang and Xu, Chunpu and Guo, Shuyue and others , journal =. 2024 , doi =
2024
-
[14]
He, Zheqi and Wu, Xinya and Zhou, Pengfei and Xuan, Richeng and Liu, Guang and Yang, Xi and Zhu, Qiannan and Huang, Hua , booktitle =. 2024 , pages =. doi:10.24963/ijcai.2024/92 , url =
-
[15]
Liu, Yang and Cao, Jiahuan and Cheng, Hiuyi and Shi, Yongxin and Ding, Kai and Jin, Lianwen , booktitle =. 2025 , address =. doi:10.18653/v1/2025.acl-long.515 , url =
-
[16]
2025 , url =
Chen, Zijian and Chen, Tingzhu and Zhang, Wenjun and Zhai, Guangtao , booktitle =. 2025 , url =
2025
-
[17]
Benchmarking Vision-Language Models on
Yu, Haiyang and Wu, Yuchuan and Shi, Fan and Liao, Lei and Lu, Jinghui and Ge, Xiaodong and Wang, Han and Zhuo, Minghan and Wu, Xuecheng and Fei, Xiang and Feng, Hao and Tang, Guozhi and Wang, An-Lan and Zhu, Hanshen and He, Yangfan and Liang, Quanhuan and Meng, Liyuan and Feng, Chao and Huang, Can and Tang, Jingqun and Li, Bin , journal =. Benchmarking V...
2025
-
[18]
2026 , doi =
Wei, Xuefeng and Wang, Zhixuan and Zhou, Xuan and Qu, Zhi and Li, Hongyao and Sakai, Yusuke and Kamigaito, Hidetaka and Watanabe, Taro , journal =. 2026 , doi =
2026
-
[19]
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = aug, year =. doi:10.18653/v1/2024.acl-long.254 , url =
-
[20]
By the tower law of field extensions, we have:[K:F] = [K:E]·[E:F]
Zheng, Chujie and Zhang, Zhenru and Zhang, Beichen and Lin, Runji and Lu, Keming and Yu, Bowen and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang , booktitle =. 2025 , address =. doi:10.18653/v1/2025.acl-long.50 , url =
-
[21]
2025 , publisher =
Jiang, Dongzhi and Zhang, Renrui and Guo, Ziyu and Li, Yanwei and Qi, Yu and Chen, Xinyan and Wang, Liuhui and Jin, Jianhan and Guo, Claire and Yan, Shen and Zhang, Bo and Fu, Chaoyou and Gao, Peng and Li, Hongsheng , booktitle =. 2025 , publisher =
2025
-
[22]
Proceedings of the IEEE/CVF International Conference on Computer Vision , year =
Understanding Museum Exhibits using Vision-Language Reasoning , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , year =
-
[23]
Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops , year =
Alfarano, Andrea and Venturoli, Lorenzo and. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops , year =
-
[24]
Advances in Neural Information Processing Systems , volume =
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =
2022
-
[25]
2024 , url =
Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng , booktitle =. 2024 , url =
2024
-
[26]
Kil, Jihyung and Mai, Zheda and Lee, Justin and Chowdhury, Arpita and Wang, Zihe and Cheng, Kerrie and Wang, Lemeng and Liu, Ye and Chao, Wei-Lun , booktitle =. 2024 , note =. doi:10.52202/079017-0906 , url =
-
[27]
Pandya, Pranshu and Gupta, Vatsal and Talwarr, Agney S and Kataria, Tushar and Roth, Dan and Gupta, Vivek , booktitle =. 2025 , address =. doi:10.18653/v1/2025.findings-naacl.204 , url =
-
[28]
Zhang, Xinyu and Dong, Yuxuan and Wu, Yanrui and Huang, Jiaxing and Jia, Chengyou and Fernando, Basura and Shou, Mike Zheng and Zhang, Lingling and Liu, Jun , booktitle =. 2025 , address =. doi:10.18653/v1/2025.acl-long.811 , url =
-
[29]
2025 , pages =
Ai, Jiaxin and Zhou, Pengfei and Xu, Zhaopan and Li, Ming and Zhang, Fanrui and Li, Zizhen and Sun, Jianwen and Feng, Yukang and Huang, Baojin and Wang, Zhongyuan and Zhang, Kaipeng , booktitle =. 2025 , pages =
2025
-
[30]
Seeing Culture: A Benchmark for Visual Reasoning and Grounding
Seeing Culture: A Benchmark for Visual Reasoning and Grounding , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , month = nov, year =. doi:10.18653/v1/2025.emnlp-main.1131 , url =
-
[31]
Ge, Jinchao and Cheng, Tengfei and Wu, Biao and Zhang, Zeyu and Huang, Shiya and Bishop, Judith and Shepherd, Gillian and Fang, Meng and Chen, Ling and Zhao, Yang , booktitle =. 2026 , address =. doi:10.18653/v1/2026.findings-eacl.60 , url =
-
[32]
2026 , note =
Wang, Weiyun and Gao, Zhangwei and Chen, Lianjie and Chen, Zhe and Zhu, Jinguo and Zhao, Xiangyu and Liu, Yangzhou and Cao, Yue and Ye, Shenglong and Zhu, Xizhou and Lu, Lewei and Duan, Haodong and Qiao, Yu and Dai, Jifeng and Wang, Wenhai , booktitle =. 2026 , note =
2026
-
[33]
Xu, Zhaopan and Zhou, Pengfei and Ai, Jiaxin and Zhao, Wangbo and Wang, Kai and Peng, Xiaojiang and Shao, Wenqi and Yao, Hongxun and Zhang, Kaipeng , booktitle =. 2025 , address =. doi:10.18653/v1/2025.findings-acl.1112 , url =
-
[34]
2025 , doi =
Zhu, Jinguo and Wang, Weiyun and Chen, Zhe and Liu, Zhaoyang and Ye, Shenglong and Gu, Lixin and Duan, Yuchen and Tian, Hao and Su, Weijie and Shao, Jie and Gao, Zhangwei and Cui, Erfei and Cao, Yue and Liu, Yangzhou and Xu, Weiye and Li, Hao and Wang, Jiahao and Lv, Han and Chen, Dengnian and Li, Songze and He, Yinan and Jiang, Tan and Luo, Jiapeng and W...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.