pith. sign in

arxiv: 2606.21618 · v1 · pith:MN4IUF7Rnew · submitted 2026-06-19 · 💻 cs.CL

CulMind: Benchmarking Multimodal Understanding and Reasoning in Chinese Cultural Heritage

Pith reviewed 2026-06-26 14:23 UTC · model grok-4.3

classification 💻 cs.CL
keywords Chinese cultural heritagemultimodal large language modelsreasoning evaluationbenchmarkReaScoremultimodal understandingcultural heritage AIreasoning processes
0
0 comments X

The pith

MLLMs exhibit a substantial gap between answer accuracy and reasoning quality on Chinese cultural heritage tasks, with task-adaptive metrics aligning better with expert judgments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CulMind, a benchmark of 50 tasks drawn from collections in more than 100 museums, along with CulMind-R, a 24-task subset focused on reasoning processes in multimodal large language models applied to Chinese cultural heritage. It proposes ReaScore, a metric that evaluates reasoning quality through automatic selection and weighting of task-relevant dimensions rather than relying solely on final-answer accuracy. Experiments across 14 leading models demonstrate that correct answers frequently accompany incomplete or flawed reasoning, with the discrepancy most pronounced on challenging tasks. A sympathetic reader would care because current evaluation practices risk overstating model competence in domains requiring integration of visual, textual, stylistic, and historical information. The benchmark and metric aim to enable assessments that more closely match expert human judgment.

Core claim

CulMind and CulMind-R form a high-quality benchmark for multimodal understanding and reasoning in Chinese cultural heritage, spanning 50 tasks from over 100 museums and a 24-task reasoning subset. ReaScore serves as a task-adaptive metric that evaluates reasoning by automatically weighting task-relevant dimensions. Experiments on 14 leading MLLMs reveal a substantial gap between answers and reasoning, especially on challenging tasks, while task-adaptive dimension selection and weighting better align evaluation results with expert judgments.

What carries the argument

ReaScore, a task-adaptive metric that evaluates reasoning processes by automatically selecting and weighting task-relevant dimensions drawn from visual, textual, stylistic, and historical clues.

If this is right

  • Assessment of MLLMs on cultural heritage should prioritize reasoning-process quality alongside final answers.
  • Models require targeted improvements in generating complete reasoning chains that integrate multiple clue types.
  • Task-adaptive evaluation methods provide a reference that can transfer to other specialized multimodal domains.
  • Public release of the benchmark data and scripts enables direct comparison and extension by other researchers.
  • Existing benchmarks that ignore reasoning completeness may produce overoptimistic performance estimates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gap between answers and reasoning may appear in MLLM evaluations for other cultural or historical domains outside China.
  • Adopting dimension-weighted scoring could shift benchmark design in multimodal AI toward process-oriented metrics.
  • Extending the adaptive weighting approach to new task sets could test its robustness without retraining models.
  • Failure modes identified in challenging tasks might guide data collection for improving model reasoning in niche domains.

Load-bearing premise

The selected 50 tasks and 24-task reasoning subset adequately capture fine-grained reasoning over the relevant clues, and ReaScore's adaptive weighting produces scores that match expert judgments.

What would settle it

Collect expert ratings of reasoning quality on the 24-task subset for model outputs from the 14 MLLMs and check whether ReaScore scores correlate more strongly with those ratings than standard answer-accuracy scores do; misalignment would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.21618 by Jiajun Zhang, Liangbin Yang, Qi Meng, Shuhan Fan, Yangfu Zhu, Yihang Peng, Yuting Wei, Zhangwei Cao.

Figure 1
Figure 1. Figure 1: Overview of CULMIND and CULMIND-R. CULMIND covers 50 fine-grained tasks across seven CCH subdomains, while the brown-highlighted region denotes CULMIND-R, the 24-task reasoning-process subset. The seven CCH subdomains are illustrated in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: presents the seven CCH subdomains covered by CULMIND. Detailed examples of the 50 tasks in CULMIND are presented in Figures 3–16. Ancient Books Q: 将图像中的古文翻译成符合现代表达规范的现代汉 语,要求:直接输出翻译后的文本,不要包含无关内 容。 Translate the classical Chinese text in the image into modern Chinese that conforms to contemporary expression standards. Requirements: Output only the translated text directly; do not include any irrelevant cont… view at source ↗
Figure 3
Figure 3. Figure 3: CULMIND examples for Tasks 1–4 [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CULMIND examples for Tasks 5–7 [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: CULMIND examples for Tasks 8–11 [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: C [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: CULMIND examples for Tasks 16–19 [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: CULMIND examples for Tasks 20–23 [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: CULMIND examples for Tasks 24–27 [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: CULMIND examples for Tasks 28–29 [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: CULMIND examples for Tasks 30–32 [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: CULMIND examples for Tasks 33–36 [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: CULMIND examples for Tasks 37–40 [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: CULMIND examples for Tasks 41–44 [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: CULMIND examples for Tasks 45–48 [PITH_FULL_IMAGE:figures/full_fig_p035_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: CULMIND examples for Tasks 49–50 [PITH_FULL_IMAGE:figures/full_fig_p036_16.png] view at source ↗
read the original abstract

Evaluating Multimodal Large Language Models (MLLMs) in Chinese Cultural Heritage (CCH) requires fine-grained reasoning over visual, textual, stylistic, and historical clues. However, existing CCH benchmarks mainly emphasize final-answer accuracy, while the accuracy and completeness of reasoning processes remain underexplored. To address this gap, we introduce CulMind and CulMind-R: a high-quality benchmark for multimodal CCH covering 50 tasks from collections of more than 100 museums, and a 24-task reasoning subset that adaptively defines task-specific dimensions for reasoning process evaluation. To evaluate reasoning quality, we propose ReaScore, a task-adaptive metric that evaluates reasoning by automatically weighting task-relevant dimensions. Experiments on 14 leading MLLMs reveal a substantial gap between answers and reasoning, especially on challenging tasks. Further analysis shows that task-adaptive dimension selection and weighting better align evaluation results with expert judgments. Overall, our benchmark and metric support a more expert-aligned assessment of CCH understanding and offer a transferable reference for broader evaluations of cultural heritage. We publicly release the data, code, and evaluation scripts at https://github.com/ZevTsao/CulMind to facilitate reproducible research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces CulMind, a benchmark with 50 tasks sourced from collections of more than 100 museums for multimodal understanding of Chinese Cultural Heritage (CCH), and CulMind-R, its 24-task reasoning-focused subset that defines task-specific dimensions. It proposes ReaScore, a task-adaptive metric that evaluates reasoning quality by automatically weighting relevant dimensions. Experiments with 14 leading MLLMs demonstrate a substantial gap between final-answer accuracy and reasoning quality (especially on challenging tasks), with analysis showing that the adaptive dimension selection and weighting align evaluation results more closely with expert judgments. The data, code, and evaluation scripts are publicly released.

Significance. If the results hold, the work fills a clear gap in CCH evaluation by shifting focus from answer accuracy alone to fine-grained reasoning over visual, textual, stylistic, and historical clues. The public release of data, code, and evaluation scripts is a notable strength that supports reproducibility and provides a transferable reference for other cultural-heritage benchmarks.

major comments (1)
  1. [Abstract] Abstract and benchmark-construction description: the claims of a 'substantial gap between answers and reasoning' and 'better align[ment] with expert judgments' rest on the 50-task/24-task split and ReaScore's adaptive weighting, yet no concrete task examples, inter-annotator agreement statistics, dimension definitions, or quantitative alignment metrics (e.g., correlation with experts) are supplied; these details are load-bearing for verifying the central claims.
minor comments (1)
  1. Clarify the exact procedure for 'automatically weighting task-relevant dimensions' in ReaScore, including any hyperparameters or selection criteria, to improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for additional supporting details to substantiate the central claims. We address the comment below and will revise the manuscript accordingly to improve clarity and verifiability.

read point-by-point responses
  1. Referee: [Abstract] Abstract and benchmark-construction description: the claims of a 'substantial gap between answers and reasoning' and 'better align[ment] with expert judgments' rest on the 50-task/24-task split and ReaScore's adaptive weighting, yet no concrete task examples, inter-annotator agreement statistics, dimension definitions, or quantitative alignment metrics (e.g., correlation with experts) are supplied; these details are load-bearing for verifying the central claims.

    Authors: We agree that the abstract and high-level benchmark description would benefit from more explicit supporting details to allow readers to verify the claims without immediately consulting the full sections or supplementary materials. The full manuscript (Section 3 on benchmark construction and Section 4 on ReaScore) does include task examples (e.g., Table 1 and Figure 2), dimension definitions for the 24-task subset, and the rationale for the 50/24 split, along with the public release of data and code at the GitHub repository. However, inter-annotator agreement statistics for the annotation process and quantitative alignment metrics (such as correlation between ReaScore and expert judgments) are computed in our internal analysis but not prominently reported. We will revise the manuscript to: (1) add 2-3 concrete task examples with their reasoning dimensions directly in the main text or a new appendix subsection; (2) report IAA statistics (e.g., Cohen's kappa or percentage agreement) for dimension annotation; (3) include quantitative alignment results (e.g., Pearson/Spearman correlation with expert ratings) in the analysis section; and (4) briefly reference these in the abstract if space permits. These additions will be made in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces CulMind, CulMind-R, and ReaScore as new benchmark and metric contributions without any equations, derivations, or load-bearing self-citations. The 50-task/24-task construction and task-adaptive weighting are presented as independent design choices for expert alignment, with no reduction of predictions or results to fitted inputs or self-referential definitions. The derivation chain is self-contained as a standard benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Benchmark paper with no free parameters, invented entities, or non-standard axioms beyond the domain assumption that current CCH evaluations undervalue reasoning processes.

axioms (1)
  • domain assumption Existing CCH benchmarks mainly emphasize final-answer accuracy while the accuracy and completeness of reasoning processes remain underexplored.
    Stated directly in the abstract as the motivation for the new benchmark.

pith-pipeline@v0.9.1-grok · 5760 in / 1149 out tokens · 18943 ms · 2026-06-26T14:23:45.788147+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 15 canonical work pages · 1 internal anchor

  1. [1]

    Can Large Language Model Comprehend

    Zhang, Yixuan and Li, Haonan , booktitle =. Can Large Language Model Comprehend. 2023 , address =

  2. [2]

    2023 , address =

    Zhou, Bo and Chen, Qianglong and Wang, Tianyu and Zhong, Xiaomi and Zhang, Yin , booktitle =. 2023 , address =. doi:10.18653/v1/2023.findings-acl.204 , url =

  3. [3]

    2024 , address =

    Wei, Yuting and Xu, Yuanxing and Wei, Xinru and Yang, Simin and Zhu, Yangfu and Li, Yuqing and Liu, Di and Wu, Bin , booktitle =. 2024 , address =. doi:10.18653/v1/2024.findings-emnlp.87 , url =

  4. [4]

    2024 , address =

    Cao, Jiahuan and Peng, Dezhi and Zhang, Peirong and Shi, Yongxin and Liu, Yang and Ding, Kai and Jin, Lianwen , booktitle =. 2024 , address =. doi:10.18653/v1/2024.findings-emnlp.243 , url =

  5. [5]

    2024 , doi =

    Cao, Jiahuan and Shi, Yongxin and Peng, Dezhi and Liu, Yang and Jin, Lianwen , journal =. 2024 , doi =

  6. [6]

    2025 , doi =

    Bai, Shuai and Cai, Yuxuan and Chen, Ruizhe and Chen, Keqin and Chen, Xionghui and Cheng, Zesen and Deng, Lianghao and Ding, Wei and Gao, Chang and Ge, Chunjiang and Ge, Wenbin and Guo, Zhifang and Huang, Qidong and Huang, Jie and Huang, Fei and Hui, Binyuan and Jiang, Shutong and Li, Zhaohai and Li, Mingsheng and Li, Mei and Li, Kaixin and Lin, Zicheng a...

  7. [7]

    2025 , doi =

    Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Z...

  8. [8]
  9. [9]

    2025 , doi =

    Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and Wang, Zhaokai and Chen, Zhe and Zhang, Hongjie and Yang, Ganlin and Wang, Haomin and Wei, Qi and Yin, Jinhui and Li, Wenhao and Cui, Erfei and Chen, Guanzhou and Ding, Zichen and Tian, Changy...

  10. [10]

    MMMU -Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    Yue, Xiang and Zheng, Tianyu and Ni, Yuansheng and Wang, Yubo and Zhang, Kai and Tong, Shengbang and Sun, Yuxuan and Yu, Botao and Zhang, Ge and Sun, Huan and Su, Yu and Chen, Wenhu and Neubig, Graham , booktitle =. 2025 , address =. doi:10.18653/v1/2025.acl-long.736 , url =

  11. [11]

    2025 , note =

    Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and Wu, Yunsheng and Ji, Rongrong and Shan, Caifeng and He, Ran , booktitle =. 2025 , note =

  12. [12]

    2024 , publisher =

    Yu, Weihao and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Kevin and Liu, Zicheng and Wang, Xinchao and Wang, Lijuan , booktitle =. 2024 , publisher =

  13. [13]

    2024 , doi =

    Zhang, Ge and Du, Xinrun and Chen, Bei and Liang, Yiming and Luo, Tongxu and Zheng, Tianyu and Zhu, Kang and Cheng, Yuyang and Xu, Chunpu and Guo, Shuyue and others , journal =. 2024 , doi =

  14. [14]

    2024 , pages =

    He, Zheqi and Wu, Xinya and Zhou, Pengfei and Xuan, Richeng and Liu, Guang and Yang, Xi and Zhu, Qiannan and Huang, Hua , booktitle =. 2024 , pages =. doi:10.24963/ijcai.2024/92 , url =

  15. [15]

    2025 , address =

    Liu, Yang and Cao, Jiahuan and Cheng, Hiuyi and Shi, Yongxin and Ding, Kai and Jin, Lianwen , booktitle =. 2025 , address =. doi:10.18653/v1/2025.acl-long.515 , url =

  16. [16]

    2025 , url =

    Chen, Zijian and Chen, Tingzhu and Zhang, Wenjun and Zhai, Guangtao , booktitle =. 2025 , url =

  17. [17]

    Benchmarking Vision-Language Models on

    Yu, Haiyang and Wu, Yuchuan and Shi, Fan and Liao, Lei and Lu, Jinghui and Ge, Xiaodong and Wang, Han and Zhuo, Minghan and Wu, Xuecheng and Fei, Xiang and Feng, Hao and Tang, Guozhi and Wang, An-Lan and Zhu, Hanshen and He, Yangfan and Liang, Quanhuan and Meng, Liyuan and Feng, Chao and Huang, Can and Tang, Jingqun and Li, Bin , journal =. Benchmarking V...

  18. [18]

    2026 , doi =

    Wei, Xuefeng and Wang, Zhixuan and Zhou, Xuan and Qu, Zhi and Li, Hongyao and Sakai, Yusuke and Kamigaito, Hidetaka and Watanabe, Taro , journal =. 2026 , doi =

  19. [19]

    A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains

    A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = aug, year =. doi:10.18653/v1/2024.acl-long.254 , url =

  20. [20]

    By the tower law of field extensions, we have:[K:F] = [K:E]·[E:F]

    Zheng, Chujie and Zhang, Zhenru and Zhang, Beichen and Lin, Runji and Lu, Keming and Yu, Bowen and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang , booktitle =. 2025 , address =. doi:10.18653/v1/2025.acl-long.50 , url =

  21. [21]

    2025 , publisher =

    Jiang, Dongzhi and Zhang, Renrui and Guo, Ziyu and Li, Yanwei and Qi, Yu and Chen, Xinyan and Wang, Liuhui and Jin, Jianhan and Guo, Claire and Yan, Shen and Zhang, Bo and Fu, Chaoyou and Gao, Peng and Li, Hongsheng , booktitle =. 2025 , publisher =

  22. [22]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , year =

    Understanding Museum Exhibits using Vision-Language Reasoning , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , year =

  23. [23]

    Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops , year =

    Alfarano, Andrea and Venturoli, Lorenzo and. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops , year =

  24. [24]

    Advances in Neural Information Processing Systems , volume =

    Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

  25. [25]

    2024 , url =

    Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng , booktitle =. 2024 , url =

  26. [26]

    2024 , note =

    Kil, Jihyung and Mai, Zheda and Lee, Justin and Chowdhury, Arpita and Wang, Zihe and Cheng, Kerrie and Wang, Lemeng and Liu, Ye and Chao, Wei-Lun , booktitle =. 2024 , note =. doi:10.52202/079017-0906 , url =

  27. [27]

    2025 , address =

    Pandya, Pranshu and Gupta, Vatsal and Talwarr, Agney S and Kataria, Tushar and Roth, Dan and Gupta, Vivek , booktitle =. 2025 , address =. doi:10.18653/v1/2025.findings-naacl.204 , url =

  28. [28]

    2025 , address =

    Zhang, Xinyu and Dong, Yuxuan and Wu, Yanrui and Huang, Jiaxing and Jia, Chengyou and Fernando, Basura and Shou, Mike Zheng and Zhang, Lingling and Liu, Jun , booktitle =. 2025 , address =. doi:10.18653/v1/2025.acl-long.811 , url =

  29. [29]

    2025 , pages =

    Ai, Jiaxin and Zhou, Pengfei and Xu, Zhaopan and Li, Ming and Zhang, Fanrui and Li, Zizhen and Sun, Jianwen and Feng, Yukang and Huang, Baojin and Wang, Zhongyuan and Zhang, Kaipeng , booktitle =. 2025 , pages =

  30. [30]

    Seeing Culture: A Benchmark for Visual Reasoning and Grounding

    Seeing Culture: A Benchmark for Visual Reasoning and Grounding , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , month = nov, year =. doi:10.18653/v1/2025.emnlp-main.1131 , url =

  31. [31]

    2026 , address =

    Ge, Jinchao and Cheng, Tengfei and Wu, Biao and Zhang, Zeyu and Huang, Shiya and Bishop, Judith and Shepherd, Gillian and Fang, Meng and Chen, Ling and Zhao, Yang , booktitle =. 2026 , address =. doi:10.18653/v1/2026.findings-eacl.60 , url =

  32. [32]

    2026 , note =

    Wang, Weiyun and Gao, Zhangwei and Chen, Lianjie and Chen, Zhe and Zhu, Jinguo and Zhao, Xiangyu and Liu, Yangzhou and Cao, Yue and Ye, Shenglong and Zhu, Xizhou and Lu, Lewei and Duan, Haodong and Qiao, Yu and Dai, Jifeng and Wang, Wenhai , booktitle =. 2026 , note =

  33. [33]

    2025 , address =

    Xu, Zhaopan and Zhou, Pengfei and Ai, Jiaxin and Zhao, Wangbo and Wang, Kai and Peng, Xiaojiang and Shao, Wenqi and Yao, Hongxun and Zhang, Kaipeng , booktitle =. 2025 , address =. doi:10.18653/v1/2025.findings-acl.1112 , url =

  34. [34]

    2025 , doi =

    Zhu, Jinguo and Wang, Weiyun and Chen, Zhe and Liu, Zhaoyang and Ye, Shenglong and Gu, Lixin and Duan, Yuchen and Tian, Hao and Su, Weijie and Shao, Jie and Gao, Zhangwei and Cui, Erfei and Cao, Yue and Liu, Yangzhou and Xu, Weiye and Li, Hao and Wang, Jiahao and Lv, Han and Chen, Dengnian and Li, Songze and He, Yinan and Jiang, Tan and Luo, Jiapeng and W...