Recognition: 2 theorem links
· Lean TheoremK12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs
Pith reviewed 2026-05-12 03:42 UTC · model grok-4.3
The pith
A knowledge graph extracted from K-12 textbooks creates both a diagnostic benchmark for curriculum cognition and a compact training set that improves educational LLM performance more efficiently than general instruction data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a curriculum-aligned knowledge graph with seven node types and nine relation types, extracted from primary-to-high-school textbooks, can generate both a 23,640-question benchmark exposing deficiencies in prerequisite and association reasoning and a 2,300-pair supervised fine-tuning corpus that, under matched data budgets, produces stronger gains on standard educational evaluation sets than subsets of mainstream instruction corpora.
What carries the argument
K12-KGraph, the directed graph whose nodes represent concepts, skills, experiments, exercises, sections, chapters, and books, and whose edges capture taxonomy, prerequisite, association, verification, assessment, location, and order relations derived directly from textbook content.
If this is right
- Fine-tuning with graph-structured data of only 2,300 examples improves results on GaokaoBench and EduEval beyond what equal volumes of general instruction data achieve.
- Existing LLMs show clear shortfalls on tasks that require identifying prerequisites, neighbors, evidence, and locations within the curriculum graph.
- The graph enables construction of multiple task families that separately test grounding, prerequisite awareness, neighborhood relations, evidence retrieval, and section location.
- Releasing the graph, benchmark, training set, and extraction pipeline supports direct reproduction and further scaling to additional subjects.
- Curriculum cognition measured by the benchmark is distinct from the factual recall already tested by C-Eval, CMMLU, and similar exams.
Where Pith is reading between the lines
- The same graph could be used to generate ordered learning sequences that respect prerequisite dependencies for individual students.
- Embedding the graph structure inside model training or inference might allow explicit traversal of knowledge relations instead of relying solely on pattern matching.
- Rebuilding analogous graphs from textbooks in other languages or national curricula would test whether the efficiency gains transfer beyond the Chinese K-12 context.
- Automated synthesis from graphs may offer a scalable way to create high-quality educational data with reduced need for manual curation.
Load-bearing premise
The automated pipeline that extracts the graph from textbooks and synthesizes QA pairs from its nodes and edges produces examples that faithfully reflect curriculum structure without meaningful artifacts or biases.
What would settle it
Manual review of a large random sample of the synthesized QA pairs followed by retraining the same base models on the cleaned data and checking whether the reported performance advantage on GaokaoBench and EduEval disappears.
read the original abstract
Large language models (LLMs) are increasingly used in K-12 education, yet existing benchmarks such as C-Eval, CMMLU, GaokaoBench, and EduEval mainly evaluate factual recall through exam-style question answering. Effective educational AI additionally requires curriculum cognition: understanding how knowledge is structured through prerequisite chains, concept taxonomies, experiment-concept links, and pedagogical sequencing. To address this gap, we introduce K12-KGraph, a curriculum-aligned knowledge graph extracted from official People's Education Press textbooks across mathematics, physics, chemistry, and biology from primary to high school. The graph contains seven node types (Concept, Skill, Experiment, Exercise, Section, Chapter, Book) and nine relation types covering taxonomy, prerequisite, association, verification, assessment, location, and order. Based on this graph, we construct two resources: (1) K12-Bench, a 23,640-question multi-select benchmark spanning five graph-derived task families (Ground, Prereq, Neighbor, Evidence, and Locate); and (2) K12-Train, a KG-guided supervised fine-tuning corpus of approximately 2,300 QA pairs synthesized from graph structure and node attributes. Experiments reveal substantial deficiencies in curriculum cognition: on K12-Bench, Gemini-3-Flash achieves only 57% exact match, while the best open-source model, Gemma-4-31B-IT, reaches 46%. Under a strictly matched 2,300-sample SFT budget on Qwen3-4B-Base and Llama-3.1-8B-Base, K12-Train consistently outperforms equally sized subsets from eight mainstream instruction-tuning corpora on both GaokaoBench and EduEval, demonstrating that curriculum-structured supervision is highly sample-efficient for educational tuning. We release the graph, benchmark, training data, and full construction pipeline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces K12-KGraph, a curriculum-aligned knowledge graph extracted from official People's Education Press textbooks across mathematics, physics, chemistry, and biology from primary to high school. The graph has seven node types (Concept, Skill, Experiment, Exercise, Section, Chapter, Book) and nine relation types (taxonomy, prerequisite, association, verification, assessment, location, order). From it, the authors derive K12-Bench (23,640 multi-select questions across five graph-derived task families: Ground, Prereq, Neighbor, Evidence, Locate) and K12-Train (~2,300 QA pairs synthesized from graph structure and node attributes). Experiments show current LLMs struggle on K12-Bench (Gemini-3-Flash at 57% exact match, Gemma-4-31B-IT at 46%) and that SFT on K12-Train under a matched 2,300-sample budget outperforms equally sized subsets from eight mainstream instruction-tuning corpora on GaokaoBench and EduEval for Qwen3-4B-Base and Llama-3.1-8B-Base.
Significance. If the automated extraction and synthesis steps are shown to be accurate and free of artifacts, this work provides valuable resources for benchmarking and improving curriculum cognition (prerequisite chains, taxonomies, experiment links) in educational LLMs beyond factual recall benchmarks. The matched-budget SFT results, if isolated to the graph-derived structure, would demonstrate high sample efficiency of curriculum-aligned supervision. The explicit release of the graph, benchmark, training data, and full construction pipeline is a clear strength that supports reproducibility and follow-on work.
major comments (1)
- [Abstract and K12-Train construction section] Abstract and K12-Train construction section: the central claim that K12-Train's outperformance under matched 2,300-sample SFT is caused by curriculum structure (prerequisite chains, taxonomies, experiment-concept links) requires that the automated QA synthesis produces high-quality, unbiased examples. No details are provided on graph extraction accuracy from textbooks, inter-annotator agreement for validation, or an ablation replacing graph-guided synthesis with random textbook excerpts of equal size and domain; this is load-bearing because pipeline artifacts or indirect overlap with GaokaoBench/EduEval distributions could explain the gains instead.
minor comments (2)
- [Abstract] The abstract states 'approximately 2,300' QA pairs; reporting the exact count and breakdown by subject or task family would improve reproducibility and allow better assessment of the matched-budget experiments.
- [K12-Bench description] The five task families in K12-Bench are described as 'graph-derived' but the precise mapping from node/relation types to each family (e.g., how 'Locate' uses location relations) is not fully specified; adding a table or diagram would clarify benchmark construction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger validation of our pipeline. We address the major comment below and will revise the manuscript to incorporate the requested details and analysis.
read point-by-point responses
-
Referee: [Abstract and K12-Train construction section] Abstract and K12-Train construction section: the central claim that K12-Train's outperformance under matched 2,300-sample SFT is caused by curriculum structure (prerequisite chains, taxonomies, experiment-concept links) requires that the automated QA synthesis produces high-quality, unbiased examples. No details are provided on graph extraction accuracy from textbooks, inter-annotator agreement for validation, or an ablation replacing graph-guided synthesis with random textbook excerpts of equal size and domain; this is load-bearing because pipeline artifacts or indirect overlap with GaokaoBench/EduEval distributions could explain the gains instead.
Authors: We agree that the central claim would be strengthened by explicit validation of the QA synthesis quality. In the revised manuscript we will expand the K12-Train construction section to include: (1) a description of the rule-based graph extraction process from the structured official textbooks together with accuracy figures from manual verification on a held-out sample of nodes and relations; (2) clarification that validation was performed via author consensus checks rather than multi-annotator agreement metrics; and (3) results of an ablation that replaces the graph-guided synthesis with randomly sampled textbook excerpts of identical size and subject distribution. These additions will directly test whether the observed gains on GaokaoBench and EduEval can be attributed to curriculum structure rather than artifacts or distributional overlap. revision: yes
Circularity Check
No circularity: empirical comparison on external benchmarks
full rationale
The paper's load-bearing claim is an empirical result under controlled 2300-sample SFT budgets: K12-Train outperforms matched subsets from eight other corpora on GaokaoBench and EduEval. This does not reduce to any internal equation, fitted parameter renamed as prediction, or self-citation chain. K12-Bench is constructed from the graph but the reported efficiency result uses external benchmarks. No uniqueness theorem, ansatz, or renaming of known results is invoked to derive the central finding. The graph extraction and QA synthesis are construction steps whose quality is an assumption, not a circular derivation of the performance numbers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Official People's Education Press textbooks accurately encode the intended prerequisite chains, taxonomies, and pedagogical sequencing for K-12 subjects.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearK12-KGraph ... seven node types (Concept, Skill, Experiment, Exercise, Section, Chapter, Book) and nine relation types covering taxonomy, prerequisite, association, verification, assessment, location, and order ... K12-Train, a KG-guided supervised fine-tuning corpus of approximately 2,300 QA pairs synthesized from graph structure and node attributes.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclearPrereq: Prerequisite Reasoning (prerequisites_for). ... Neighbor: Neighbor Recommendation (is_a, relates_to).
Reference graph
Works this paper leans on
-
[1]
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, et al. Smollm2: When smol goes big–data-centric training of a small language model.arXiv preprint arXiv:2502.02737, 2025
work page internal anchor Pith review arXiv 2025
-
[2]
Llms4ol: Large language models for ontology learning
Hamed Babaei Giglou, Jennifer D’Souza, and Sören Auer. Llms4ol: Large language models for ontology learning. In International semantic web conference, pages 408–427. Springer, 2023
work page 2023
-
[3]
Prerequisite-driven deep knowledge tracing
Penghe Chen, Yu Lu, Vincent W Zheng, and Yang Pian. Prerequisite-driven deep knowledge tracing. In2018 IEEE international conference on data mining (ICDM), pages 39–48. IEEE, 2018
work page 2018
-
[4]
Enhancing chat language models by scaling high-quality instructional conversations
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, 2023
work page 2023
-
[5]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
E-eval: A comprehensive chinese k-12 education evaluation benchmark for large language models
Jinchang Hou, Chang Ao, Haihong Wu, Xiangtao Kong, Zhigang Zheng, Daijia Tang, Chengming Li, Xiping Hu, Ruifeng Xu, Shiwen Ni, et al. E-eval: A comprehensive chinese k-12 education evaluation benchmark for large language models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 7753–7774, 2024
work page 2024
-
[9]
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advancesin neural information processing systems, 36:62991–63010, 2023
work page 2023
-
[10]
Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. Chatgpt for good? on opportunities and challenges of large language models for education.Learning and individual differences, 103:102274, 2023
work page 2023
-
[11]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Cmmlu: Measuring massive multitask language understanding in chinese
Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11260–11285, 2024
work page 2024
-
[13]
Infinity instruct: Scaling instruction selection and synthesis to enhance language models
Jijie Li, Li Du, Hanyu Zhao, Bo-wen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, and Yonghua Lin. Infinity instruct: Scaling instruction selection and synthesis to enhance language models.arXiv preprint arXiv:2506.11116, 2025
-
[14]
Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, et al. Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025
-
[15]
Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026
work page internal anchor Pith review arXiv 2026
-
[16]
Qi Liu, Zhenya Huang, Yu Yin, Enhong Chen, Hui Xiong, Yu Su, and Guoping Hu. Ekt: Exercise-aware knowledge tracing for student performance prediction.IEEE Transactionson Knowledge and Data Engineering, 33(1):100–115, 2019. K12-KGraph 13
work page 2019
-
[17]
arXiv preprint arXiv:2306.08568 , year=
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568, 2023
-
[18]
Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Yue Cui, Jiawei Shen, Zilong Li, and Yidan Liang. Edueval: A hierarchical cognitive benchmark for evaluating large language models in chinese education.arXiv preprint arXiv:2512.00290, 2025
-
[19]
Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, et al. Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing. In The 64th Annual Meeting of the Association for Computational Linguistics–IndustryTrack, 2025
work page 2025
-
[20]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[21]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023
Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023. URLhttps: //huggingface.co/datasets/teknium/OpenHermes-2.5
work page 2023
-
[23]
Self-instruct: Aligning language models with self-generated instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484–13508, 2023
work page 2023
-
[24]
Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu, Yufeng Chen, Meishan Zhang, et al. Chatie: Zero-shot information extraction via chatting with chatgpt.arXiv preprint arXiv:2302.10205, 2023
-
[25]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Yuqing Ye, Xuan Zhou, Zhifu Chen, Dandan Li, Hengnian Gu, Jin Peng Zhou, and Dongdai Zhou. K-12edubench: A benchmark for evaluating large language models’ knowledge, problem-solving, and educational goal cognition in k-12 education. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34459–34466, 2026
work page 2026
-
[27]
Ck12: a rounded k12 knowledge graph based benchmark for chinese holistic cognition evaluation
Weihao You, Pengcheng Wang, Changlong Li, Zhilong Ji, and Jinfeng Bai. Ck12: a rounded k12 knowledge graph based benchmark for chinese holistic cognition evaluation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19431–19439, 2024
work page 2024
-
[28]
Retrieve anything to augment large language models
Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, and Jian-Yun Nie. Retrieve anything to augment large language models. arXiv preprint arXiv:2310.07554, 2023
-
[29]
arXiv:2305.12474 (2023).https://doi.org/10.48550/ arXiv.2305.12474
Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. Evaluating the performance of large language models on gaokao benchmark.arXiv preprint arXiv:2305.12474, 2023
-
[30]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P Xing, et al. Lmsys-chat-1m: A large-scale real-world llm conversation dataset.arXiv preprint arXiv:2309.11998, 2023
-
[31]
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine- tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations), pages 400–410, 2024. K12-KGraph 14 Table 5 K12-KGraph schema:seven node types and nine edge ...
work page 2024
-
[32]
Candidate pooling:multi-layer expansion from graph neighborhoods to broader curriculum-based pools; 2.Rule-based filtering:removal of gold answers, surface-form duplicates, and task-invalid candidates; 3.Pedagogical filtering:LLM-based filtering to discard weak, trivial, or controversial distractors. K12-KGraph 16 Candidates are finally deduplicated and s...
work page 2026
-
[33]
• All content must bestrictly grounded in the input properties; do not introduce external knowledge
General Constraints •Each example must contain exactlyone question and one answer. • All content must bestrictly grounded in the input properties; do not introduce external knowledge. •Use aliases when appropriate to improve naturalness
-
[34]
Concept-Oriented QA •Questions should focus on: ■definition (what it is), ■key properties or formulas, ■basic understanding. •Answers must follow the structure: \box{core definition}+\box{core formula / key rule}(optional) followed by: ■1 2 sentences of explanation (optional), ■1 illustrative example (optional)
-
[35]
Skill-Oriented QA •Questions should focus on: ■what the method is, ■how to apply it (steps or procedure), ■when to use it (applicable scenarios). •Answers must follow the structure: \box{method description or steps} followed by: ■brief explanation of usage or advantages (optional), ■1 illustrative example (optional). Style Requirements •Language must becl...
-
[36]
• All content must bestrictly grounded in the input properties; do not introduce external knowledge
General Constraints •Each example must contain exactlyone question and one answer. • All content must bestrictly grounded in the input properties; do not introduce external knowledge. • Questions must focus on explainingwhy the edge holds, rather than merely restating the two endpoint names
-
[37]
Relation-Specific Prompting Goals •is_a :ask why {source_name} belongs to or is part of {target_name}; answers should explain how the source satisfies the defining properties of the target. •prerequisites_for :ask why one should learn {source_name} before{target_name}; answers should explain what knowledge or ability the source provides that supports late...
-
[38]
Preferred Answer Structures •is_a : because {source definition}, {target definition}, and {source satisfies target definition}, therefore {source belongs to target}. •prerequisites_for : because learning the target requires {specific knowledge/ability}, and the source provides this knowledge/ability, therefore the source should be learned first. •relates_...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.