Recognition: unknown
Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination
Pith reviewed 2026-05-08 03:31 UTC · model grok-4.3
The pith
Even state-of-the-art LLMs struggle with complex historical research questions requiring evidentiary reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ProHist-Bench consists of 400 expert-curated questions spanning eight dynasties with 10,891 rubrics for scoring. When 18 LLMs were tested, even the best models performed poorly on tasks requiring deep historical analysis and evidentiary support, revealing a significant proficiency gap in professional-level historical reasoning.
What carries the argument
ProHist-Bench, a benchmark of 400 challenging questions and 10,891 fine-grained evaluation rubrics derived from the Chinese Imperial Examination (Keju) system.
Load-bearing premise
The 400 questions and 10,891 rubrics accurately capture the higher-order evidentiary reasoning skills central to professional historical research.
What would settle it
An independent panel of historians answering the 400 questions and receiving rubric scores comparable to those of top LLMs would contradict the claimed proficiency gap.
Figures
read the original abstract
While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ProHist-Bench, a benchmark of 400 expert-curated questions drawn from the Chinese Imperial Examination (Keju) system across eight dynasties, paired with 10,891 fine-grained rubrics developed via interdisciplinary collaboration. It evaluates 18 LLMs on this benchmark and concludes that even state-of-the-art models exhibit a significant proficiency gap on complex historical research questions involving higher-order skills such as evidentiary reasoning. The benchmark is released publicly to support further work in domain-specific reasoning and computational history.
Significance. If the benchmark is shown to be a valid proxy for professional historical research capabilities, the work would usefully document current LLM limitations on structured historical reasoning tasks and supply a large-scale, rubric-annotated resource for model development. The public release of the dataset and the scale of the rubric set (10,891 items) constitute concrete strengths that could enable reproducible follow-up studies.
major comments (1)
- The central claim that LLMs 'struggle with complex historical research questions' (abstract) depends on ProHist-Bench faithfully measuring higher-order evidentiary reasoning rather than exam-format recall and structured argumentation. The manuscript provides no correlation study, inter-rater validation against practicing historians, or comparison to open-ended research tasks (e.g., primary-source triangulation), leaving the proxy relationship unverified and the broader conclusion at risk.
minor comments (1)
- The abstract and introduction would benefit from an explicit statement of how the 400 questions were sampled across dynasties and question types; a summary table would improve transparency.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address the central concern regarding the validity of ProHist-Bench as a proxy for higher-order historical reasoning below, providing the strongest honest defense grounded in the manuscript's design and limitations.
read point-by-point responses
-
Referee: The central claim that LLMs 'struggle with complex historical research questions' (abstract) depends on ProHist-Bench faithfully measuring higher-order evidentiary reasoning rather than exam-format recall and structured argumentation. The manuscript provides no correlation study, inter-rater validation against practicing historians, or comparison to open-ended research tasks (e.g., primary-source triangulation), leaving the proxy relationship unverified and the broader conclusion at risk.
Authors: We appreciate the referee's emphasis on rigorous validation of the proxy relationship. ProHist-Bench is explicitly anchored in the Keju system, which for over a millennium served as the primary mechanism for assessing precisely the higher-order skills of evidentiary reasoning, source synthesis, and structured argumentation rather than rote recall; the 400 questions were selected and phrased by historians to require these competencies, and the 10,891 rubrics were co-developed through interdisciplinary collaboration to score granular elements such as evidence citation, logical coherence, and contextual integration. This historical grounding and expert curation provide substantive support for the benchmark's alignment with professional historical research capabilities. That said, we acknowledge that the manuscript does not include a post-hoc correlation analysis with practicing historians' judgments on open-ended tasks or direct comparisons to primary-source triangulation exercises. We will revise the manuscript to expand the discussion of benchmark construction, explicitly articulate the rationale for using Keju as a proxy, and add a dedicated limitations subsection that notes the absence of such external validation studies while outlining directions for future work to address this gap. revision: partial
Circularity Check
No circularity: new benchmark and empirical evaluation are self-contained
full rationale
The paper's derivation consists of introducing ProHist-Bench (400 questions, 10,891 rubrics from Keju exams via new interdisciplinary curation) and reporting LLM performance on it. No equations, fitted parameters, self-citations, or ansatzes are described that would reduce the central claim (LLMs struggle with higher-order historical reasoning) to the inputs by construction. The benchmark is explicitly positioned as filling a gap in prior evaluations rather than redefining or fitting to the target result. This is the normal case of an empirical benchmark paper whose load-bearing steps rest on fresh data collection.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Chinese Imperial Examination (Keju) system constitutes a valid microcosm of East Asian political, social, and intellectual history suitable for testing higher-order historical reasoning.
Reference graph
Works this paper leans on
-
[1]
CMMLU: Measuring Massive Multitask Language Understanding in Chinese.arXiv:2306.09212, 2023a
ACM. Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Tim Bald- win. 2023. Cmmlu: Measuring massive multi- task language understanding in chinese.ArXiv, abs/2306.09212. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain. Associ...
-
[2]
InProceedings of the 47th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pages 2038–2048
Chroniclingamericaqa: A large-scale question answering dataset based on historical american news- paper pages. InProceedings of the 47th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pages 2038–2048. Qwen Team. 2025. Qwen3-Max: Just scale it. Yongxin Shi, Chongyu Liu, Dezhi Peng, Cheng Jian, Jiarong Huang, and ...
2038
-
[3]
Plawbench: A rubric-based benchmark for evaluating llms in real-world legal practice.CoRR, abs/2601.16669. 5 Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, and 152 others. 2025a. G...
-
[4]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. Yanfang Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Yiyang Li, Shifu Hou, Weixiang Sun, Kai- wen Shi, Yijun Ma, Wei Song, and 1 others. 2025. Llms4all: A systematic review of large language models across academic disciplines.arXiv preprint arXiv:2509.19580. Chenhao Zhang, Xi Feng, Yuelin Bai, Xeron Du...
work page internal anchor Pith review arXiv 2025
-
[5]
Use concise, scholarly language; avoid sensationalism or superficial, generalized explanations. 2. Explicitly specify dates, locations, key figures, historical contexts, and factual details
-
[6]
Identify existing academic controversies and the nuances of diverging scholarly perspectives
-
[7]
general definitions
Avoid conjecture or fabrication; ensure all content is verifiable through historical sources. The goal is to ensure your responses possess value for academic discourse, rather than serving merely as simple informational introductions. Please answer the following question: User: {Question} Professional Prompting: System:你是一位熟练使用5W1H历史分析法的中国历史研究者。你的任务是基于用户的...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.