pith. sign in

arxiv: 2606.18257 · v1 · pith:HCFWUYJNnew · submitted 2026-05-06 · 💻 cs.HC · cs.AI

From Memorization to Creation: Evaluating the Cognitive Depth of LLM-Generated Educational Questions

Pith reviewed 2026-06-30 23:48 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords LLM-generated questionsBloom's Taxonomycognitive levelsprompting strategyeducational content creationhigher-order thinkingrepetitiveness reductionCogShift metric
0
0 comments X

The pith

A fine-grained prompting strategy reduces repetitiveness in LLM-generated educational questions by 24.45 percent and raises higher-order cognitive outputs by 11.53 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can move beyond generating rote memorization questions to ones that require analysis, evaluation, and creation. It runs six models on 20,700 questions across computer science, K-12 math, and social science, classifying outputs with Bloom's Taxonomy. A new prompting method is shown to cut repeated questions and increase the share of higher-level items, while new metrics track how cleanly models shift between cognitive categories. One model stands out for making multi-level jumps. The results supply concrete benchmarks for using LLMs to support personalized learning that targets deeper understanding.

Core claim

Fine-grained prompting strategies enable LLMs to produce educational questions with measurably greater cognitive depth according to Bloom's Taxonomy, cutting repetitiveness by 24.45 percent in Qwen2.5-7B-Instruct and lifting the proportion of higher-order outputs by 11.53 percent in InternLM3-8B-Instruct, with InternLM3 also showing the strongest performance on multi-level cognitive transitions as quantified by the CogShift intensity and category-drift metrics.

What carries the argument

The fine-grained prompting strategy that directs models toward explicit Bloom's Taxonomy levels, paired with the hybrid human-AI evaluation protocol and the CogShift metric that quantifies cognitive shift intensity.

If this is right

  • LLMs become more usable for creating questions that target analysis, evaluation, and creation rather than recall.
  • InternLM3 demonstrates stronger capacity for transitioning across multiple Bloom's levels than the other five models tested.
  • Chain-of-Thought prompting produces interpretable correlations with the CogShift and category-drift metrics.
  • The reported benchmarks can guide deployment of LLMs inside personalized learning systems that aim to build higher-order skills.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompting approach could be tested in additional subject areas such as history or biology to check whether the gains generalize.
  • If the CogShift metric proves stable, it might serve as an automatic filter for selecting which LLM outputs to use in live tutoring systems.
  • Improved cognitive depth in questions might reduce student reliance on surface-level study habits, though that outcome would need separate measurement.

Load-bearing premise

That the hybrid human-AI evaluation protocol reliably and unbiasedly assesses the cognitive levels of generated questions according to Bloom's Taxonomy without significant inter-rater variability or model bias in the evaluation.

What would settle it

A fresh set of 1,000 generated questions re-scored by a new panel of human raters who were not involved in the original protocol yields cognitive-level distributions that differ by more than 15 percent from the reported hybrid scores.

Figures

Figures reproduced from arXiv: 2606.18257 by Chaoli Zhang, Qingsong Wen, Song Lai, Xiaolong Wang, Ye Wei, Yu Tong, Zhe Zhao, Zijie Geng.

Figure 1
Figure 1. Figure 1: An illustration of current limitations in LLM [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cognitive level transitions of LLM-generated questions under CoT and FGP. Each subplot visualizes the Bloom-level [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mean shift score for cognitive leap, drift, and regression across LLMs under CoT and FGP. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Correlations among evaluation metrics. (a) Bloom-level [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Classification performance of different models across Bloom’s taxonomy categories. The "Application" (Ap) category [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt templates for guiding LLMs to generate and evaluate educational questions in computer science and K-12 [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

While LLMs show promise in automating educational content creation, their ability to generate questions that stimulate higher-order thinking remains understudied. This work evaluates six widely-used LLMs through a Bloom's Taxonomy lens, focusing on their capacity to transcend rote memorization and achieve cognitive leaps. Using a hybrid human--AI evaluation protocol, we generate and analyze 20{,}700 questions across computer science, K--12 math, and social-science domains. Key contributions include: (1) a fine-grained prompting strategy that reduces question repetitiveness by 24.45\% for Qwen2.5-7B-Instruct, and increases the proportion of higher-order cognitive level outputs by 11.53\% for InternLM3-8B-Instruct; (2) quantitative metrics for cognitive shift intensity (CogShift) and category drift, revealing InternLM3's superior performance in multi-level transitions; (3) an interpretability analysis revealing metric-level correlations that enhance the transparency of Chain-of-Thought prompting. Our findings highlight the importance of cognitive-aware prompt design and provide benchmarks for deploying LLMs in personalized learning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper evaluates six LLMs on their ability to generate educational questions that reach higher levels of Bloom's Taxonomy rather than rote memorization. It generates and analyzes 20,700 questions across computer science, K-12 math, and social-science domains via a hybrid human-AI protocol, introduces a fine-grained prompting strategy claimed to reduce repetitiveness by 24.45% (Qwen2.5-7B-Instruct) and raise higher-order outputs by 11.53% (InternLM3-8B-Instruct), defines CogShift and category-drift metrics, and reports an interpretability analysis of Chain-of-Thought prompting.

Significance. If the labeling protocol proves reliable, the work supplies concrete benchmarks and a new metric (CogShift) for measuring cognitive depth in LLM-generated educational content. These could usefully inform prompt design for personalized learning tools and provide comparative data across model families.

major comments (2)
  1. [Abstract] Abstract: the hybrid human-AI evaluation protocol is described only at the level of 'human-AI' with no inter-rater reliability statistics (Cohen/Fleiss kappa or percent agreement) among humans, no human-AI agreement rates, and no blinded validation set. All headline percentages (24.45% repetitiveness reduction, 11.53% higher-order increase) rest on Bloom's Taxonomy assignments to the 20,700 questions; without these metrics the deltas are uninterpretable.
  2. [Results section] Results (CogShift and multi-level transitions): the claim of InternLM3's superior performance in multi-level transitions is presented without the explicit formula for CogShift, without error bars or statistical significance tests on the category-drift comparisons, and without an ablation isolating which prompting components produce the reported shifts.
minor comments (1)
  1. [Abstract] The abstract mentions 'metric-level correlations' in the interpretability analysis but does not name the metrics or report the correlation coefficients; these should be stated explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation protocol and metric reporting. We address each major comment below and will revise the manuscript accordingly to enhance transparency and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the hybrid human-AI evaluation protocol is described only at the level of 'human-AI' with no inter-rater reliability statistics (Cohen/Fleiss kappa or percent agreement) among humans, no human-AI agreement rates, and no blinded validation set. All headline percentages (24.45% repetitiveness reduction, 11.53% higher-order increase) rest on Bloom's Taxonomy assignments to the 20,700 questions; without these metrics the deltas are uninterpretable.

    Authors: We agree that the abstract provides only a high-level description. The full manuscript (Section 3.2) details the hybrid protocol but omits reliability statistics. We will add Cohen's kappa for human raters, human-AI agreement rates on a held-out validation set, and clarify the blinded labeling procedure. This will directly support the reported deltas. revision: yes

  2. Referee: [Results section] Results (CogShift and multi-level transitions): the claim of InternLM3's superior performance in multi-level transitions is presented without the explicit formula for CogShift, without error bars or statistical significance tests on the category-drift comparisons, and without an ablation isolating which prompting components produce the reported shifts.

    Authors: The CogShift definition appears in Section 4.1, but we acknowledge it lacks full explicitness and supporting statistics. We will insert the complete formula, add error bars with significance tests (e.g., paired t-tests or Wilcoxon) for category-drift results, and expand Section 5 with a targeted ablation isolating prompting components. These additions address the concerns without altering core claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical counts from labeled generations

full rationale

The paper reports experimental results from generating 20,700 questions with six LLMs under different prompting strategies, then labeling them via a hybrid human-AI protocol against Bloom's Taxonomy levels. Headline deltas (24.45% repetitiveness reduction, 11.53% higher-order increase) and new metrics (CogShift, category drift) are obtained by direct counting and comparison of those labels; no equations, fitted parameters, or derivations are presented that reduce by construction to the inputs. No self-citation load-bearing uniqueness claims or ansatzes appear. The work is self-contained as an empirical benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; no specific free parameters, axioms, or invented entities are detailed beyond reliance on Bloom's Taxonomy as a standard cognitive framework.

pith-pipeline@v0.9.1-grok · 5746 in / 1153 out tokens · 32986 ms · 2026-06-30T23:48:22.962337+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    2001.A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives: complete edition

    LW Anderson and DR Krathwohl. 2001.A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives: complete edition. Addison Wesley Longman, Inc

  2. [2]

    M Arvan, M Valizadeh, P Haghighat, et al . 2023. Linguistic Cognitive Load Analysis on Dialogues with an Intelligent Virtual Assistant. InProceedings of the Annual Meeting of the Cognitive Science Society, Vol. 45

  3. [3]

    A Bartel, B Matlen, D Rohrer, et al. 2023. Applying cognitive learning principles to practice: Challenges in translation and large-scale study design. InProceedings of the Annual Meeting of the Cognitive Science Society, Vol. 45

  4. [4]

    Zheng Cai and et al. 2024. InternLM2 Technical Report. arXiv:2403.17297 [cs.CL]

  5. [5]

    Y Chen, N Ding, HT Zheng, et al. 2024. Empowering private tutoring by chaining large language models. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 354–364

  6. [6]

    Y Chen, C Wu, S Yan, et al . 2024. Dr. academy: A benchmark for evaluating questioning capability in education for large language models.arXiv preprint arXiv:2408.10947(2024)

  7. [7]

    Y Chen, Y Xiao, and B Liu. 2022. Grow-and-Clip: Informative-yet-Concise Evi- dence Distillation for Answer Explanation. In2022 IEEE 38th International Con- ference on Data Engineering (ICDE). IEEE, 741–754

  8. [8]

    Christodoulos Constantinides, Varun Sharma, Sheng Lin, Nan Zhou, Bhaskar Chaudhury, and Divya Patel. 2025. Auto-Q: Automated Domain Questions Gen- eration for Industrial Assets. InProceedings of the AAAI Conference on Artificial Intelligence

  9. [9]

    N Daheim, J Macina, M Kapur, et al. 2024. Stepwise verification and remediation of student reasoning errors with large language model tutors.arXiv preprint arXiv:2407.09136(2024)

  10. [10]

    K D’Silva and B Matlen. 2023. Embedding Equitable Research Practices into the Rigorous Study of a Cognitive Learning Intervention. InProceedings of the Annual Meeting of the Cognitive Science Society, Vol. 45

  11. [11]

    Zhengxiao Du and et al. 2024. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv:2406.12793 [cs.CL]

  12. [12]

    S Elkins, E Kochmar, JCK Cheung, et al . 2024. How Teachers Can Use Large Language Models and Bloom’s Taxonomy to Create Educational Quizzes. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 23084–23091

  13. [13]

    The Llama 3 Herd of Models

    Dubey et al. and LLaMA Team. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI]

  14. [14]

    M Fedyk and M Ray. 2023. How to Leverage Machine Learning Interpretability and Explainability to Generate Hypotheses in Cognitive Psychology. InProceed- ings of the Annual Meeting of the Cognitive Science Society, Vol. 45

  15. [15]

    T Feng and L He. 2025. RGR-KBQA: Generating Logical Forms for Question An- swering Using Knowledge-Graph-Enhanced Large Language Model. InProceed- ings of the 31st International Conference on Computational Linguistics. 3057–3070

  16. [16]

    H Gong, L Pan, and H Hu. 2022. Khanq: A dataset for generating deep questions in education. InProceedings of the 29th International Conference on Computational Linguistics. 5925–5938

  17. [17]

    Baichuan Inc. 2023. Baichuan2: Open Large Language Models. arXiv:2310.11453 [cs.CL]

  18. [18]

    T Kojima, SS Gu, M Reid, et al . 2022. Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199– 22213

  19. [19]

    R Kokku, S Sundararajan, P Dey, et al. 2018. Augmenting classrooms with AI for personalized education. In2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6976–6980

  20. [20]

    P Laban, CS Wu, L Murakhovs’ ka, et al . 2022. Quiz design task: Helping teachers create quizzes with automated question generation.arXiv preprint arXiv:2205.01730(2022)

  21. [21]

    K Li and Y Zhang. 2024. Planning First, Question Second: An LLM-Guided Method for Controllable Question Generation. InFindings of the Association for Computational Linguistics ACL 2024. 4715–4729

  22. [22]

    H Liao, S He, Y Hao, et al. 2025. SKIntern: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models. InProceedings of the 31st International Conference on Computational Linguistics. 3203–3221

  23. [23]

    H Liao, S He, Y Xu, et al. 2025. Awakening Augmented Generation: Learning to Awaken Internal Knowledge of Large Language Models for Question Answering. InProceedings of the 31st International Conference on Computational Linguistics. 1333–1352

  24. [24]

    J Liu, Y Huang, S Bi, et al. 2025. From Superficial to Deep: Integrating External Knowledge for Follow-up Question Generation Using Knowledge Graph and LLM. InProceedings of the 31st International Conference on Computational Linguistics. 828–840

  25. [25]

    SS Mucciaccia, TM Paixão, FW Mutz, et al . 2025. Automatic Multiple-Choice Question Generation and Evaluation Systems Based on LLM: A Study Case With University Resolutions. InProceedings of the 31st International Conference on Computational Linguistics. 2246–2260

  26. [26]

    H Muse, S Bulathwela, and E Yilmaz. 2023. Pre-training with scientific text improves educational question generation (student abstract). InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 16288–16289

  27. [27]

    JW Park, SJ Park, HS Won, et al . 2024. Large Language Models are Students at Various Levels: Zero-shot Question Difficulty Estimation. InFindings of the Association for Computational Linguistics: EMNLP 2024. 8157–8177

  28. [28]

    X Qiu and Z Chen. 2025. A Knowledge Graph Reasoning-Based Model for Computerized Adaptive Testing. InProceedings of the 31st International Conference on Computational Linguistics. 5295–5304

  29. [29]

    M Ravikiran, S Vohra, R Verma, et al. 2025. TEEMIL: Towards Educational MCQ Difficulty Estimation in Indic Languages. InProceedings of the 31st International Conference on Computational Linguistics. 2085–2099

  30. [30]

    N Scaria, Dharani S Chenna, and D Subramani. 2024. Automated Educational Question Generation at Different Bloom’s Skill Levels Using Large Language Mod- els: Strategies and Evaluation. InInternational Conference on Artificial Intelligence in Education. Springer Nature Switzerland, Cham, 165–179

  31. [31]

    F Shi, P Qing, D Yang, et al. 2023. Prompt space optimizing few-shot reasoning success with large language models.arXiv preprint arXiv:2306.03799(2023)

  32. [32]

    S Tobler, T Sinha, K Koehler, et al . 2022. The impact of prior knowledge in narrative-based learning on understanding biological concepts in higher ed- ucation. InProceedings of the Annual Meeting of the Cognitive Science Society, Vol. 44

  33. [33]

    J Wang, J Macina, N Daheim, et al. 2024. Book2Dial: Generating Teacher-Student Interactions from Textbooks for Cost-Effective Development of Educational Chatbots.arXiv preprint arXiv:2403.03307(2024)

  34. [34]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-Instruct: Aligning Language Models with Self-Generated Instructions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)

  35. [35]

    J Wei, X Wang, D Schuurmans, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

  36. [36]

    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. 2025. Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. InProceedings of the Interna- tional Conference on Learning Representations (ICLR)

  37. [37]

    An Yang and et al. 2025. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL]

  38. [38]

    X Yuan, T Wang, YH Wang, et al. 2022. Selecting better samples from pre-trained LLMs: A case study on question generation.arXiv preprint arXiv:2209.11000 (2022)

  39. [39]

    ✓” indicates that the aspect is explicitly addressed, while “–

    Khandoker Ashik Uz Zaman, Ashraful Islam, Yusuf Mahbubul Islam, and Md Abu Sayed. 2024. Dataset of computer science course queries from students: Cat- egorized and scored according to Bloom’s taxonomy.Data in Brief53 (2024), 110109. A Comparative dimensions. In Table 2, DS denotes explicit data synthesis for instructions or questions, Bloom explicit contr...