pith. machine review for the scientific record. sign in

arxiv: 2605.09635 · v1 · submitted 2026-05-10 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

Hao Liang, Linzhuang Sun, Meiyi Qiang, Qihan Lin, Wentao Zhang, Xiaochen Ma, Zhaoyang Han, Zhen Hao Wong

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:42 UTC · model grok-4.3

classification 💻 cs.CL
keywords knowledge graphK-12 educationcurriculum cognitioneducational LLMsupervised fine-tuningprerequisite chainstextbook extractionGaokaoBench
0
0 comments X

The pith

A knowledge graph extracted from K-12 textbooks creates both a diagnostic benchmark for curriculum cognition and a compact training set that improves educational LLM performance more efficiently than general instruction data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds K12-KGraph from official Chinese textbooks in math, physics, chemistry, and biology to encode how knowledge is organized through prerequisites, taxonomies, experiments, and sequencing. It turns the graph into K12-Bench, a large multi-select test that measures whether models understand these structures rather than isolated facts, and finds current models perform poorly. From the same graph it synthesizes K12-Train, roughly 2,300 question-answer pairs, and shows that fine-tuning base models on this small set yields higher scores on GaokaoBench and EduEval than equal-sized samples drawn from eight common instruction-tuning collections. The result indicates that supervision organized around curriculum relationships can be more sample-efficient for education-focused capabilities than broader but unstructured data.

Core claim

The central claim is that a curriculum-aligned knowledge graph with seven node types and nine relation types, extracted from primary-to-high-school textbooks, can generate both a 23,640-question benchmark exposing deficiencies in prerequisite and association reasoning and a 2,300-pair supervised fine-tuning corpus that, under matched data budgets, produces stronger gains on standard educational evaluation sets than subsets of mainstream instruction corpora.

What carries the argument

K12-KGraph, the directed graph whose nodes represent concepts, skills, experiments, exercises, sections, chapters, and books, and whose edges capture taxonomy, prerequisite, association, verification, assessment, location, and order relations derived directly from textbook content.

If this is right

  • Fine-tuning with graph-structured data of only 2,300 examples improves results on GaokaoBench and EduEval beyond what equal volumes of general instruction data achieve.
  • Existing LLMs show clear shortfalls on tasks that require identifying prerequisites, neighbors, evidence, and locations within the curriculum graph.
  • The graph enables construction of multiple task families that separately test grounding, prerequisite awareness, neighborhood relations, evidence retrieval, and section location.
  • Releasing the graph, benchmark, training set, and extraction pipeline supports direct reproduction and further scaling to additional subjects.
  • Curriculum cognition measured by the benchmark is distinct from the factual recall already tested by C-Eval, CMMLU, and similar exams.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph could be used to generate ordered learning sequences that respect prerequisite dependencies for individual students.
  • Embedding the graph structure inside model training or inference might allow explicit traversal of knowledge relations instead of relying solely on pattern matching.
  • Rebuilding analogous graphs from textbooks in other languages or national curricula would test whether the efficiency gains transfer beyond the Chinese K-12 context.
  • Automated synthesis from graphs may offer a scalable way to create high-quality educational data with reduced need for manual curation.

Load-bearing premise

The automated pipeline that extracts the graph from textbooks and synthesizes QA pairs from its nodes and edges produces examples that faithfully reflect curriculum structure without meaningful artifacts or biases.

What would settle it

Manual review of a large random sample of the synthesized QA pairs followed by retraining the same base models on the cleaned data and checking whether the reported performance advantage on GaokaoBench and EduEval disappears.

read the original abstract

Large language models (LLMs) are increasingly used in K-12 education, yet existing benchmarks such as C-Eval, CMMLU, GaokaoBench, and EduEval mainly evaluate factual recall through exam-style question answering. Effective educational AI additionally requires curriculum cognition: understanding how knowledge is structured through prerequisite chains, concept taxonomies, experiment-concept links, and pedagogical sequencing. To address this gap, we introduce K12-KGraph, a curriculum-aligned knowledge graph extracted from official People's Education Press textbooks across mathematics, physics, chemistry, and biology from primary to high school. The graph contains seven node types (Concept, Skill, Experiment, Exercise, Section, Chapter, Book) and nine relation types covering taxonomy, prerequisite, association, verification, assessment, location, and order. Based on this graph, we construct two resources: (1) K12-Bench, a 23,640-question multi-select benchmark spanning five graph-derived task families (Ground, Prereq, Neighbor, Evidence, and Locate); and (2) K12-Train, a KG-guided supervised fine-tuning corpus of approximately 2,300 QA pairs synthesized from graph structure and node attributes. Experiments reveal substantial deficiencies in curriculum cognition: on K12-Bench, Gemini-3-Flash achieves only 57% exact match, while the best open-source model, Gemma-4-31B-IT, reaches 46%. Under a strictly matched 2,300-sample SFT budget on Qwen3-4B-Base and Llama-3.1-8B-Base, K12-Train consistently outperforms equally sized subsets from eight mainstream instruction-tuning corpora on both GaokaoBench and EduEval, demonstrating that curriculum-structured supervision is highly sample-efficient for educational tuning. We release the graph, benchmark, training data, and full construction pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces K12-KGraph, a curriculum-aligned knowledge graph extracted from official People's Education Press textbooks across mathematics, physics, chemistry, and biology from primary to high school. The graph has seven node types (Concept, Skill, Experiment, Exercise, Section, Chapter, Book) and nine relation types (taxonomy, prerequisite, association, verification, assessment, location, order). From it, the authors derive K12-Bench (23,640 multi-select questions across five graph-derived task families: Ground, Prereq, Neighbor, Evidence, Locate) and K12-Train (~2,300 QA pairs synthesized from graph structure and node attributes). Experiments show current LLMs struggle on K12-Bench (Gemini-3-Flash at 57% exact match, Gemma-4-31B-IT at 46%) and that SFT on K12-Train under a matched 2,300-sample budget outperforms equally sized subsets from eight mainstream instruction-tuning corpora on GaokaoBench and EduEval for Qwen3-4B-Base and Llama-3.1-8B-Base.

Significance. If the automated extraction and synthesis steps are shown to be accurate and free of artifacts, this work provides valuable resources for benchmarking and improving curriculum cognition (prerequisite chains, taxonomies, experiment links) in educational LLMs beyond factual recall benchmarks. The matched-budget SFT results, if isolated to the graph-derived structure, would demonstrate high sample efficiency of curriculum-aligned supervision. The explicit release of the graph, benchmark, training data, and full construction pipeline is a clear strength that supports reproducibility and follow-on work.

major comments (1)
  1. [Abstract and K12-Train construction section] Abstract and K12-Train construction section: the central claim that K12-Train's outperformance under matched 2,300-sample SFT is caused by curriculum structure (prerequisite chains, taxonomies, experiment-concept links) requires that the automated QA synthesis produces high-quality, unbiased examples. No details are provided on graph extraction accuracy from textbooks, inter-annotator agreement for validation, or an ablation replacing graph-guided synthesis with random textbook excerpts of equal size and domain; this is load-bearing because pipeline artifacts or indirect overlap with GaokaoBench/EduEval distributions could explain the gains instead.
minor comments (2)
  1. [Abstract] The abstract states 'approximately 2,300' QA pairs; reporting the exact count and breakdown by subject or task family would improve reproducibility and allow better assessment of the matched-budget experiments.
  2. [K12-Bench description] The five task families in K12-Bench are described as 'graph-derived' but the precise mapping from node/relation types to each family (e.g., how 'Locate' uses location relations) is not fully specified; adding a table or diagram would clarify benchmark construction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger validation of our pipeline. We address the major comment below and will revise the manuscript to incorporate the requested details and analysis.

read point-by-point responses
  1. Referee: [Abstract and K12-Train construction section] Abstract and K12-Train construction section: the central claim that K12-Train's outperformance under matched 2,300-sample SFT is caused by curriculum structure (prerequisite chains, taxonomies, experiment-concept links) requires that the automated QA synthesis produces high-quality, unbiased examples. No details are provided on graph extraction accuracy from textbooks, inter-annotator agreement for validation, or an ablation replacing graph-guided synthesis with random textbook excerpts of equal size and domain; this is load-bearing because pipeline artifacts or indirect overlap with GaokaoBench/EduEval distributions could explain the gains instead.

    Authors: We agree that the central claim would be strengthened by explicit validation of the QA synthesis quality. In the revised manuscript we will expand the K12-Train construction section to include: (1) a description of the rule-based graph extraction process from the structured official textbooks together with accuracy figures from manual verification on a held-out sample of nodes and relations; (2) clarification that validation was performed via author consensus checks rather than multi-annotator agreement metrics; and (3) results of an ablation that replaces the graph-guided synthesis with randomly sampled textbook excerpts of identical size and subject distribution. These additions will directly test whether the observed gains on GaokaoBench and EduEval can be attributed to curriculum structure rather than artifacts or distributional overlap. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison on external benchmarks

full rationale

The paper's load-bearing claim is an empirical result under controlled 2300-sample SFT budgets: K12-Train outperforms matched subsets from eight other corpora on GaokaoBench and EduEval. This does not reduce to any internal equation, fitted parameter renamed as prediction, or self-citation chain. K12-Bench is constructed from the graph but the reported efficiency result uses external benchmarks. No uniqueness theorem, ansatz, or renaming of known results is invoked to derive the central finding. The graph extraction and QA synthesis are construction steps whose quality is an assumption, not a circular derivation of the performance numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that textbook-derived structure faithfully represents curriculum cognition and that synthetic QA pairs preserve that structure without distortion.

axioms (1)
  • domain assumption Official People's Education Press textbooks accurately encode the intended prerequisite chains, taxonomies, and pedagogical sequencing for K-12 subjects.
    The entire graph and all derived resources are extracted from these textbooks; any mismatch would invalidate the benchmark and training claims.

pith-pipeline@v0.9.0 · 5665 in / 1254 out tokens · 45987 ms · 2026-05-12T03:42:04.924669+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 7 internal anchors

  1. [1]

    SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, et al. Smollm2: When smol goes big–data-centric training of a small language model.arXiv preprint arXiv:2502.02737, 2025

  2. [2]

    Llms4ol: Large language models for ontology learning

    Hamed Babaei Giglou, Jennifer D’Souza, and Sören Auer. Llms4ol: Large language models for ontology learning. In International semantic web conference, pages 408–427. Springer, 2023

  3. [3]

    Prerequisite-driven deep knowledge tracing

    Penghe Chen, Yu Lu, Vincent W Zheng, and Yang Pian. Prerequisite-driven deep knowledge tracing. In2018 IEEE international conference on data mining (ICDM), pages 39–48. IEEE, 2018

  4. [4]

    Enhancing chat language models by scaling high-quality instructional conversations

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, 2023

  5. [5]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

  6. [7]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  7. [8]

    E-eval: A comprehensive chinese k-12 education evaluation benchmark for large language models

    Jinchang Hou, Chang Ao, Haihong Wu, Xiangtao Kong, Zhigang Zheng, Daijia Tang, Chengming Li, Xiping Hu, Ruifeng Xu, Shiwen Ni, et al. E-eval: A comprehensive chinese k-12 education evaluation benchmark for large language models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 7753–7774, 2024

  8. [9]

    C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

    Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advancesin neural information processing systems, 36:62991–63010, 2023

  9. [10]

    Chatgpt for good? on opportunities and challenges of large language models for education.Learning and individual differences, 103:102274, 2023

    Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. Chatgpt for good? on opportunities and challenges of large language models for education.Learning and individual differences, 103:102274, 2023

  10. [11]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024

  11. [12]

    Cmmlu: Measuring massive multitask language understanding in chinese

    Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11260–11285, 2024

  12. [13]

    Infinity instruct: Scaling instruction selection and synthesis to enhance language models

    Jijie Li, Li Du, Hanyu Zhao, Bo-wen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, and Yonghua Lin. Infinity instruct: Scaling instruction selection and synthesis to enhance language models.arXiv preprint arXiv:2506.11116, 2025

  13. [14]

    Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025

    Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, et al. Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025

  14. [15]

    Ministral 3

    Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

  15. [16]

    Ekt: Exercise-aware knowledge tracing for student performance prediction.IEEE Transactionson Knowledge and Data Engineering, 33(1):100–115, 2019

    Qi Liu, Zhenya Huang, Yu Yin, Enhong Chen, Hui Xiong, Yu Su, and Guoping Hu. Ekt: Exercise-aware knowledge tracing for student performance prediction.IEEE Transactionson Knowledge and Data Engineering, 33(1):100–115, 2019. K12-KGraph 13

  16. [17]

    arXiv preprint arXiv:2306.08568 , year=

    Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568, 2023

  17. [18]

    Edueval: A hierarchical cognitive benchmark for evaluating large language models in chinese education.arXiv preprint arXiv:2512.00290, 2025

    Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Yue Cui, Jiawei Shen, Zilong Li, and Yidan Liang. Edueval: A hierarchical cognitive benchmark for evaluating large language models in chinese education.arXiv preprint arXiv:2512.00290, 2025

  18. [19]

    Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, et al. Mineru2. 5: A decoupled vision-language model for efficient high-resolution document parsing. In The 64th Annual Meeting of the Association for Computational Linguistics–IndustryTrack, 2025

  19. [20]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730–27744, 2022

  20. [21]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

  21. [22]

    Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023

    Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023. URLhttps: //huggingface.co/datasets/teknium/OpenHermes-2.5

  22. [23]

    Self-instruct: Aligning language models with self-generated instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484–13508, 2023

  23. [24]

    Chatie: Zero-shot information extraction via chatting with chatgpt.arXiv preprint arXiv:2302.10205, 2023

    Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu, Yufeng Chen, Meishan Zhang, et al. Chatie: Zero-shot information extraction via chatting with chatgpt.arXiv preprint arXiv:2302.10205, 2023

  24. [25]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  25. [26]

    K-12edubench: A benchmark for evaluating large language models’ knowledge, problem-solving, and educational goal cognition in k-12 education

    Yuqing Ye, Xuan Zhou, Zhifu Chen, Dandan Li, Hengnian Gu, Jin Peng Zhou, and Dongdai Zhou. K-12edubench: A benchmark for evaluating large language models’ knowledge, problem-solving, and educational goal cognition in k-12 education. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34459–34466, 2026

  26. [27]

    Ck12: a rounded k12 knowledge graph based benchmark for chinese holistic cognition evaluation

    Weihao You, Pengcheng Wang, Changlong Li, Zhilong Ji, and Jinfeng Bai. Ck12: a rounded k12 knowledge graph based benchmark for chinese holistic cognition evaluation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19431–19439, 2024

  27. [28]

    Retrieve anything to augment large language models

    Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, and Jian-Yun Nie. Retrieve anything to augment large language models. arXiv preprint arXiv:2310.07554, 2023

  28. [29]

    arXiv:2305.12474 (2023).https://doi.org/10.48550/ arXiv.2305.12474

    Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. Evaluating the performance of large language models on gaokao benchmark.arXiv preprint arXiv:2305.12474, 2023

  29. [30]

    Zheng, W.-L

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P Xing, et al. Lmsys-chat-1m: A large-scale real-world llm conversation dataset.arXiv preprint arXiv:2309.11998, 2023

  30. [31]

    near-to-far

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine- tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations), pages 400–410, 2024. K12-KGraph 14 Table 5 K12-KGraph schema:seven node types and nine edge ...

  31. [32]

    first appearance

    Candidate pooling:multi-layer expansion from graph neighborhoods to broader curriculum-based pools; 2.Rule-based filtering:removal of gold answers, surface-form duplicates, and task-invalid candidates; 3.Pedagogical filtering:LLM-based filtering to discard weak, trivial, or controversial distractors. K12-KGraph 16 Candidates are finally deduplicated and s...

  32. [33]

    • All content must bestrictly grounded in the input properties; do not introduce external knowledge

    General Constraints •Each example must contain exactlyone question and one answer. • All content must bestrictly grounded in the input properties; do not introduce external knowledge. •Use aliases when appropriate to improve naturalness

  33. [34]

    Concept-Oriented QA •Questions should focus on: ■definition (what it is), ■key properties or formulas, ■basic understanding. •Answers must follow the structure: \box{core definition}+\box{core formula / key rule}(optional) followed by: ■1 2 sentences of explanation (optional), ■1 illustrative example (optional)

  34. [35]

    •Answers must follow the structure: \box{method description or steps} followed by: ■brief explanation of usage or advantages (optional), ■1 illustrative example (optional)

    Skill-Oriented QA •Questions should focus on: ■what the method is, ■how to apply it (steps or procedure), ■when to use it (applicable scenarios). •Answers must follow the structure: \box{method description or steps} followed by: ■brief explanation of usage or advantages (optional), ■1 illustrative example (optional). Style Requirements •Language must becl...

  35. [36]

    • All content must bestrictly grounded in the input properties; do not introduce external knowledge

    General Constraints •Each example must contain exactlyone question and one answer. • All content must bestrictly grounded in the input properties; do not introduce external knowledge. • Questions must focus on explainingwhy the edge holds, rather than merely restating the two endpoint names

  36. [37]

    Relation-Specific Prompting Goals •is_a :ask why {source_name} belongs to or is part of {target_name}; answers should explain how the source satisfies the defining properties of the target. •prerequisites_for :ask why one should learn {source_name} before{target_name}; answers should explain what knowledge or ability the source provides that supports late...

  37. [38]

    •prerequisites_for : because learning the target requires {specific knowledge/ability}, and the source provides this knowledge/ability, therefore the source should be learned first

    Preferred Answer Structures •is_a : because {source definition}, {target definition}, and {source satisfies target definition}, therefore {source belongs to target}. •prerequisites_for : because learning the target requires {specific knowledge/ability}, and the source provides this knowledge/ability, therefore the source should be learned first. •relates_...