arxiv: 2604.24690 · v1 · submitted 2026-04-27 · 💻 cs.CL

Recognition: unknown

Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

Lirong Gao , Zeqing Wang , Yuyan Cai , Jiayi Deng , Yanmei Gu , Yiming Zhang , Jia Zhou , Yanfei Zhang

show 1 more author

Junbo Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:31 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelshistorical reasoningbenchmarkChinese imperial examinationevidentiary reasoningAI evaluationdomain expertise

0 comments

The pith

Even state-of-the-art LLMs struggle with complex historical research questions requiring evidentiary reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ProHist-Bench to test whether LLMs can perform at the level of trained historians rather than just recall facts. It anchors the benchmark in the Chinese Imperial Examination system, which tested officials on history, policy, and reasoning across more than 1,300 years and eight dynasties. The benchmark includes 400 challenging questions paired with 10,891 fine-grained rubrics created through expert collaboration. Evaluation of 18 LLMs shows they fall short on higher-order skills like building evidence-based arguments. This matters because existing tests only measure basic knowledge or text understanding, not the integrative thinking that defines professional historical work.

Core claim

ProHist-Bench consists of 400 expert-curated questions spanning eight dynasties with 10,891 rubrics for scoring. When 18 LLMs were tested, even the best models performed poorly on tasks requiring deep historical analysis and evidentiary support, revealing a significant proficiency gap in professional-level historical reasoning.

What carries the argument

ProHist-Bench, a benchmark of 400 challenging questions and 10,891 fine-grained evaluation rubrics derived from the Chinese Imperial Examination (Keju) system.

Load-bearing premise

The 400 questions and 10,891 rubrics accurately capture the higher-order evidentiary reasoning skills central to professional historical research.

What would settle it

An independent panel of historians answering the 400 questions and receiving rubric scores comparable to those of top LLMs would contradict the claimed proficiency gap.

Figures

Figures reproduced from arXiv: 2604.24690 by Jiayi Deng, Jia Zhou, Junbo Zhao, Lirong Gao, Yanfei Zhang, Yanmei Gu, Yiming Zhang, Yuyan Cai, Zeqing Wang.

**Figure 1.** Figure 1: An illustration of LLMs’ factual hallucina view at source ↗

**Figure 3.** Figure 3: Distribution of task categories across various view at source ↗

**Figure 4.** Figure 4: Consistency score of candidate judge models view at source ↗

**Figure 5.** Figure 5: Main results of different models on the T4 task. The table presents the performance of representative view at source ↗

**Figure 6.** Figure 6: Performance heatmap across nine historical capability dimensions (R1-R9). view at source ↗

**Figure 7.** Figure 7: Performance comparison between human ex view at source ↗

**Figure 8.** Figure 8: Distribution of difficulty levels (General vs. view at source ↗

**Figure 9.** Figure 9: Cross-distribution of dynasties and task cate view at source ↗

**Figure 10.** Figure 10: The hierarchical taxonomy of the topic framework for ProHist-Bench. This framework categorizes view at source ↗

**Figure 11.** Figure 11: Prompt templates of Role Play and Professional Prompting Figure 11: Prompt Templates of Role Play and Professional Prompting. view at source ↗

**Figure 12.** Figure 12: Prompt templates of CoT and RAG Figure 12: Prompt Templates of Role Play and Professional Prompting. view at source ↗

**Figure 13.** Figure 13: Performance gap between General and Hard difficulty levels. The sharp decline in average scores (20.89 view at source ↗

**Figure 14.** Figure 14: Heatmap of LLM performance across diverse historical periods. Most LLMs perform robustly in mainstream eras (e.g., Tang, Song, Ming) but struggle in low-resource contexts like the Liao and Jin dynasties. • 王颖、黄强：《游戏八股文研究》，武汉：武汉大学出版社，2015年. • 安东强：《清代学政规制与皇权体制》，北京：社会科学文献出版社，2017年. • 关晓红：《科举停废与近代中国》，北京：社会科学文献出版社，2017年. • 韩策：《科举改制与最后的进士》，北 Model T1 T2 T3 Average Closed-Source Models Claude-Sonnet-4.5-Thin… view at source ↗

**Figure 15.** Figure 15: Case study on Task T1 (Part I): Qualitative Results. view at source ↗

**Figure 16.** Figure 16: Case study on Task T1 (Part II): Qualitative Results. view at source ↗

**Figure 17.** Figure 17: Case study on Task T2 (Part I): Qualitative Results. view at source ↗

**Figure 18.** Figure 18: Case study on Task T2 (Part II) and T3 (Part I): Qualitative Results. view at source ↗

**Figure 19.** Figure 19: Case study on Task T3 (Part II) and T4 (Part I): Qualitative Results. view at source ↗

**Figure 20.** Figure 20: Case study on Task T4 (Part II): Qualitative Results. view at source ↗

**Figure 21.** Figure 21: Sample questions, reference answers, rubrics from four tasks. view at source ↗

read the original abstract

While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProHist-Bench offers a fresh dataset from Keju exams that shows LLMs lag on structured historical tasks, but the claim it measures professional research skills rests on an untested proxy.

read the letter

The paper's real addition is ProHist-Bench itself: 400 questions pulled from the imperial examination system across eight dynasties, paired with 10,891 expert-written rubrics. That scale and the release on GitHub make it usable for anyone testing LLMs on East Asian history. The evaluation of 18 models is straightforward and documents a clear performance drop on the harder items, which aligns with the abstract's point about gaps beyond basic knowledge recall. The interdisciplinary curation is also a plus; it avoids the usual trap of AI researchers inventing history questions in isolation. Those elements are concrete and worth having in the literature. The central limitation is the leap from exam performance to historical research capability. Keju items reward precise recall of classical texts, policy argumentation in a fixed format, and exegesis under time pressure. Professional historians spend more time on source triangulation, handling incomplete archives, generating new questions from evidence, and revising interpretations over multiple passes. The paper does not report any correlation between high rubric scores and actual historian output on open research tasks, nor does it test whether experts themselves would score near ceiling on the same items. Without that check, the proficiency gap finding stays tied to exam-style output rather than the broader skill set advertised. Readers working on LLM benchmarks or domain-specific evaluation will find the dataset useful as a starting point. Historians interested in computational tools might skim it for the question set but will want follow-up validation before treating scores as evidence of research readiness. The work is coherent on its own terms and grounded enough in new data collection to merit referee time, though the proxy issue should be the main point of revision.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces ProHist-Bench, a benchmark of 400 expert-curated questions drawn from the Chinese Imperial Examination (Keju) system across eight dynasties, paired with 10,891 fine-grained rubrics developed via interdisciplinary collaboration. It evaluates 18 LLMs on this benchmark and concludes that even state-of-the-art models exhibit a significant proficiency gap on complex historical research questions involving higher-order skills such as evidentiary reasoning. The benchmark is released publicly to support further work in domain-specific reasoning and computational history.

Significance. If the benchmark is shown to be a valid proxy for professional historical research capabilities, the work would usefully document current LLM limitations on structured historical reasoning tasks and supply a large-scale, rubric-annotated resource for model development. The public release of the dataset and the scale of the rubric set (10,891 items) constitute concrete strengths that could enable reproducible follow-up studies.

major comments (1)

The central claim that LLMs 'struggle with complex historical research questions' (abstract) depends on ProHist-Bench faithfully measuring higher-order evidentiary reasoning rather than exam-format recall and structured argumentation. The manuscript provides no correlation study, inter-rater validation against practicing historians, or comparison to open-ended research tasks (e.g., primary-source triangulation), leaving the proxy relationship unverified and the broader conclusion at risk.

minor comments (1)

The abstract and introduction would benefit from an explicit statement of how the 400 questions were sampled across dynasties and question types; a summary table would improve transparency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address the central concern regarding the validity of ProHist-Bench as a proxy for higher-order historical reasoning below, providing the strongest honest defense grounded in the manuscript's design and limitations.

read point-by-point responses

Referee: The central claim that LLMs 'struggle with complex historical research questions' (abstract) depends on ProHist-Bench faithfully measuring higher-order evidentiary reasoning rather than exam-format recall and structured argumentation. The manuscript provides no correlation study, inter-rater validation against practicing historians, or comparison to open-ended research tasks (e.g., primary-source triangulation), leaving the proxy relationship unverified and the broader conclusion at risk.

Authors: We appreciate the referee's emphasis on rigorous validation of the proxy relationship. ProHist-Bench is explicitly anchored in the Keju system, which for over a millennium served as the primary mechanism for assessing precisely the higher-order skills of evidentiary reasoning, source synthesis, and structured argumentation rather than rote recall; the 400 questions were selected and phrased by historians to require these competencies, and the 10,891 rubrics were co-developed through interdisciplinary collaboration to score granular elements such as evidence citation, logical coherence, and contextual integration. This historical grounding and expert curation provide substantive support for the benchmark's alignment with professional historical research capabilities. That said, we acknowledge that the manuscript does not include a post-hoc correlation analysis with practicing historians' judgments on open-ended tasks or direct comparisons to primary-source triangulation exercises. We will revise the manuscript to expand the discussion of benchmark construction, explicitly articulate the rationale for using Keju as a proxy, and add a dedicated limitations subsection that notes the absence of such external validation studies while outlining directions for future work to address this gap. revision: partial

Circularity Check

0 steps flagged

No circularity: new benchmark and empirical evaluation are self-contained

full rationale

The paper's derivation consists of introducing ProHist-Bench (400 questions, 10,891 rubrics from Keju exams via new interdisciplinary curation) and reporting LLM performance on it. No equations, fitted parameters, self-citations, or ansatzes are described that would reduce the central claim (LLMs struggle with higher-order historical reasoning) to the inputs by construction. The benchmark is explicitly positioned as filling a gap in prior evaluations rather than redefining or fitting to the target result. This is the normal case of an empirical benchmark paper whose load-bearing steps rest on fresh data collection.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that expert-curated questions from the Keju system validly measure professional historical reasoning; no free parameters or invented entities are described.

axioms (1)

domain assumption The Chinese Imperial Examination (Keju) system constitutes a valid microcosm of East Asian political, social, and intellectual history suitable for testing higher-order historical reasoning.
Used to anchor the benchmark design and question selection across eight dynasties.

pith-pipeline@v0.9.0 · 5539 in / 1134 out tokens · 49274 ms · 2026-05-08T03:31:27.295337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 3 canonical work pages · 1 internal anchor

[1]

CMMLU: Measuring Massive Multitask Language Understanding in Chinese.arXiv:2306.09212, 2023a

ACM. Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Tim Bald- win. 2023. Cmmlu: Measuring massive multi- task language understanding in chinese.ArXiv, abs/2306.09212. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain. Associ...

work page arXiv 2023
[2]

InProceedings of the 47th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pages 2038–2048

Chroniclingamericaqa: A large-scale question answering dataset based on historical american news- paper pages. InProceedings of the 47th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pages 2038–2048. Qwen Team. 2025. Qwen3-Max: Just scale it. Yongxin Shi, Chongyu Liu, Dezhi Peng, Cheng Jian, Jiarong Huang, and ...

2038
[3]

Plawbench: A rubric-based benchmark for evaluating llms in real-world legal practice.CoRR, abs/2601.16669. 5 Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, and 152 others. 2025a. G...

work page arXiv 2024
[4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Yanfang Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Yiyang Li, Shifu Hou, Weixiang Sun, Kai- wen Shi, Yijun Ma, Wei Song, and 1 others. 2025. Llms4all: A systematic review of large language models across academic disciplines.arXiv preprint arXiv:2509.19580. Chenhao Zhang, Xi Feng, Yuelin Bai, Xeron Du...

work page internal anchor Pith review arXiv 2025
[5]

Use concise, scholarly language; avoid sensationalism or superficial, generalized explanations. 2. Explicitly specify dates, locations, key figures, historical contexts, and factual details
[6]

Identify existing academic controversies and the nuances of diverging scholarly perspectives
[7]

general definitions

Avoid conjecture or fabrication; ensure all content is verifiable through historical sources. The goal is to ensure your responses possess value for academic discourse, rather than serving merely as simple informational introductions. Please answer the following question: User: {Question} Professional Prompting: System:你是一位熟练使用5W1H历史分析法的中国历史研究者。你的任务是基于用户的...

2025