Why Machines Misread Pedagogical Quality: Human-Machine Alignment in LLM-Based Pretest Question Evaluation

Mahir Akgun; Pei-Yu Tseng; Peng Liu

arxiv: 2606.23629 · v1 · pith:NLJOXTZVnew · submitted 2026-06-22 · 💻 cs.HC

Why Machines Misread Pedagogical Quality: Human-Machine Alignment in LLM-Based Pretest Question Evaluation

Pei-Yu Tseng , Mahir Akgun , Peng Liu This is my paper

Pith reviewed 2026-06-26 06:57 UTC · model grok-4.3

classification 💻 cs.HC

keywords human-machine alignmentLLM evaluationpretest questionsrubric operationalizationpedagogical qualityAI-assisted workfloweducational assessment

0 comments

The pith

Human and LLM judgments on pretest question quality disagree systematically, with rubric revisions improving alignment more than rationale-first prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines an AI workflow that generates pretest questions and uses LLMs to evaluate their pedagogical quality at scale. It runs a 2x2 experiment that changes how rubrics are written and whether evaluations require step-by-step rationales first. The central finding is that human-machine disagreements follow consistent patterns rather than occurring by chance. Rubric revision produces a bigger gain in agreement than the rationale change, and the two adjustments reinforce each other. This matters because any large-scale use of machine filtering for educational content will depend on making quality criteria legible to the model.

Core claim

Our findings show that human-machine disagreements are systematic rather than random, that rubric revision has a larger effect on alignment than rationale-first evaluation, and that the two interventions are complementary. These findings highlight that scalable AI-assisted pretesting depends not only on generation capability but on how pedagogical quality is operationalized for machine interpretation.

What carries the argument

A 2x2 experimental design varying rubric operationalization and evaluation mode to quantify human-LLM agreement on pretest question scores.

If this is right

Improving how rubrics describe quality criteria offers a practical lever for better machine filtering of educational questions.
Using both rubric revision and rationale prompts together yields higher alignment than applying either change alone.
Because disagreements are systematic, they can be addressed through targeted adjustments instead of generic model improvements.
The workflow can support larger volumes of pretest question development once machine scoring is brought closer to human standards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same operationalization issues may appear when LLMs score other educational artifacts such as lesson plans or student answers.
Testing the interventions across different subjects or question formats would show whether the relative importance of rubric revision holds more broadly.
If systematic biases can be reduced through rubric design, it could lower the amount of ongoing human oversight needed in AI education pipelines.

Load-bearing premise

The 2x2 design captures the main factors that drive alignment, and the chosen rubrics and questions are representative of typical pedagogical quality assessment.

What would settle it

A replication in which rubric revision no longer shows a larger alignment gain than rationale-first evaluation, or in which disagreements appear random rather than patterned.

Figures

Figures reproduced from arXiv: 2606.23629 by Mahir Akgun, Pei-Yu Tseng, Peng Liu.

**Figure 1.** Figure 1: Overview of the question generation workflow with prompt templates. The workflow consists of three components that operate iteratively (Figure 1). In each round, the Generator receives all four inputs and, after the first round, feedback from the Controller, and outputs a batch of candidate questions. The Evaluator first scores each question individually on all rubric dimensions, removing those below leve… view at source ↗

read the original abstract

Designing effective pretest questions is challenging at scale: high-quality questions require careful calibration of openness, cognitive depth, and alignment with learning objectives, yet generating and evaluating them manually is time-consuming. We present an AI-assisted workflow for pretest question development that combines automated generation, rubric-based evaluation, and iterative selection. Because the workflow relies on machine evaluation to filter questions at scale, we investigate the alignment between human and machine judgments across a 2x2 design varying rubric operationalization and evaluation mode. Our findings show that human-machine disagreements are systematic rather than random, that rubric revision has a larger effect on alignment than rationale-first evaluation, and that the two interventions are complementary. These findings highlight that scalable AI-assisted pretesting depends not only on generation capability but on how pedagogical quality is operationalized for machine interpretation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rubric operationalization improves human-LLM alignment on pretest questions more than rationale-first mode, with the two changes working together.

read the letter

The main thing to take from this paper is that rubric revision has a bigger impact on human-LLM alignment than rationale-first evaluation when assessing pedagogical quality of pretest questions, and the two are complementary.

The paper runs a 2x2 experiment on this in the context of an AI workflow for generating and selecting pretest questions. It is new in applying alignment ideas specifically to this education task rather than general question answering or other domains. The work does well by showing that disagreements are systematic and by testing practical changes that could be implemented in tools. It frames the problem around scalable pretesting and points out that how quality is operationalized for the machine matters.

The soft spots are mostly around missing details. The abstract gives the findings but no sample sizes, no stats, no information on how many questions or raters were involved, and no description of the actual rubrics or questions used. This makes it hard to evaluate if the effects are large or if the setup is representative. The weakest assumption is that the 2x2 design covers the main factors affecting alignment and that their choices generalize. If the full paper fills in those gaps with solid data, the argument holds; otherwise the results stay hard to interpret.

This paper is for people working on AI-assisted education systems or human-AI collaboration in content evaluation. Someone looking for case studies of alignment experiments in a new applied area could find it relevant. It deserves serious peer review because it has a clear experimental question and reports new empirical results, even if the write-up needs more methodological transparency. Referees would likely ask for the numbers and checks on generalizability, which is normal for this kind of work.

Referee Report

1 major / 0 minor

Summary. The manuscript presents an AI-assisted workflow for pretest question development combining LLM generation, rubric-based evaluation, and iterative selection. It reports results from a 2x2 experimental design that varies rubric operationalization and evaluation mode (rationale-first vs. direct), finding that human-machine disagreements on pedagogical quality are systematic rather than random, that revising the rubric has a larger positive effect on alignment than rationale-first prompting, and that the two interventions are complementary.

Significance. If the empirical results are robust, the work provides actionable guidance on operationalizing pedagogical rubrics for LLMs, showing that alignment depends more on how quality criteria are defined than on the order of rationale generation. This could inform scalable educational content tools, though the absence of reported sample sizes, statistical tests, and selection criteria in the abstract limits immediate assessment of effect sizes and generalizability.

major comments (1)

[Abstract] Abstract: the central claims (systematic disagreements, larger effect of rubric revision, complementarity) are stated without any accompanying quantitative details such as number of questions, number of human evaluators, inter-rater reliability, or statistical tests. This absence makes it impossible to determine whether the data actually support the reported ordering of effects or the conclusion that disagreements are systematic.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment on the abstract below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims (systematic disagreements, larger effect of rubric revision, complementarity) are stated without any accompanying quantitative details such as number of questions, number of human evaluators, inter-rater reliability, or statistical tests. This absence makes it impossible to determine whether the data actually support the reported ordering of effects or the conclusion that disagreements are systematic.

Authors: We agree with the referee that the abstract would benefit from including key quantitative details to allow readers to better assess the strength of our claims. The full manuscript contains these details (including the number of questions evaluated in the 2x2 design, the number of human evaluators, inter-rater reliability metrics, and the statistical tests used to compare effects). We will revise the abstract to incorporate these elements, such as sample sizes and a brief mention of the statistical findings supporting the ordering of effects. This revision will make the central claims more verifiable from the abstract alone. revision: yes

Circularity Check

0 steps flagged

Empirical study with no derivation chain or self-referential elements

full rationale

This paper reports results from a 2x2 experimental design comparing rubric operationalization and evaluation modes in an LLM-based pretest question evaluation workflow. The abstract and described structure contain no equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations that reduce claims to prior author work. Findings on systematic disagreements and intervention effects are presented as direct outcomes of the human-machine comparison experiment, with no mathematical or definitional reductions to the inputs. The analysis is therefore self-contained as standard empirical reporting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the experimental factors isolate effects on alignment and that the rubrics validly represent pedagogical quality for machine interpretation.

axioms (1)

domain assumption Rubric operationalization and evaluation mode are the primary controllable factors affecting human-LLM alignment in question evaluation
The study structures its investigation around a 2x2 design varying exactly these two elements.

pith-pipeline@v0.9.1-grok · 5672 in / 1153 out tokens · 23412 ms · 2026-06-26T06:57:33.957867+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 6 canonical work pages

[1]

and Rahman, Shuhebur and Perkins, Kyle , title =

Carpenter, Shana K. and Rahman, Shuhebur and Perkins, Kyle , title =. Journal of Experimental Psychology: Applied , year =. doi:10.1037/xap0000145 , pmid =

work page doi:10.1037/xap0000145
[2]

and Sana, Faria , title =

Pan, Steven C. and Sana, Faria , title =. Journal of Experimental Psychology: Applied , year =. doi:10.1037/xap0000345 , pmid =

work page doi:10.1037/xap0000345
[3]

and Kornell, Nate and Kao, Liche Sean , title =

Richland, Lindsey E. and Kornell, Nate and Kao, Liche Sean , title =. Journal of Experimental Psychology: Applied , year =. doi:10.1037/a0016496 , pmid =

work page doi:10.1037/a0016496
[4]

, title =

Sana, Faria and Carpenter, Shana K. , title =. Psychonomic Bulletin & Review , year =. doi:10.3758/s13423-023-02274-6 , pmid =

work page doi:10.3758/s13423-023-02274-6
[5]

and Sana, Faria and Samani, Joshua and Cooke, James and Kim, Joseph A

Pan, Steven C. and Sana, Faria and Samani, Joshua and Cooke, James and Kim, Joseph A. , title =. Memory , year =. doi:10.1080/09658211.2020.1815790 , pmid =

work page doi:10.1080/09658211.2020.1815790 2020
[6]

Journal of Educational Measurement , volume=

Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations , author=. Journal of Educational Measurement , volume=. 2003 , publisher=

2003
[7]

and Rock, Donald A

Bennett, Randy Elliot and Ward, William C. and Rock, Donald A. and LaHart, Charles , title =. 1990 , type =. doi:10.1002/j.2333-8504.1990.tb01348.x , url =

work page doi:10.1002/j.2333-8504.1990.tb01348.x 1990
[8]

Research Monograph No

Criteria for Alignment of Expectations and Assessments in Mathematics and Science Education. Research Monograph No. 6. , author=. 1997 , publisher=

1997
[9]

International journal of artificial intelligence in education , volume=

A systematic review of automatic question generation for educational purposes , author=. International journal of artificial intelligence in education , volume=. 2020 , publisher=

2020
[10]

Theory into practice , volume=

A revision of Bloom's taxonomy: An overview , author=. Theory into practice , volume=. 2002 , publisher=

2002
[11]

Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle=. G-Eval:
[12]

Findings of the Association for Computational Linguistics: EMNLP 2023 , year=

A Closer Look into Automatic Evaluation Using Large Language Models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , year=

2023
[13]

Education and Information Technologies , volume=

Exploring quality criteria and evaluation methods in automated question generation: A comprehensive survey , author=. Education and Information Technologies , volume=

[1] [1]

and Rahman, Shuhebur and Perkins, Kyle , title =

Carpenter, Shana K. and Rahman, Shuhebur and Perkins, Kyle , title =. Journal of Experimental Psychology: Applied , year =. doi:10.1037/xap0000145 , pmid =

work page doi:10.1037/xap0000145

[2] [2]

and Sana, Faria , title =

Pan, Steven C. and Sana, Faria , title =. Journal of Experimental Psychology: Applied , year =. doi:10.1037/xap0000345 , pmid =

work page doi:10.1037/xap0000345

[3] [3]

and Kornell, Nate and Kao, Liche Sean , title =

Richland, Lindsey E. and Kornell, Nate and Kao, Liche Sean , title =. Journal of Experimental Psychology: Applied , year =. doi:10.1037/a0016496 , pmid =

work page doi:10.1037/a0016496

[4] [4]

, title =

Sana, Faria and Carpenter, Shana K. , title =. Psychonomic Bulletin & Review , year =. doi:10.3758/s13423-023-02274-6 , pmid =

work page doi:10.3758/s13423-023-02274-6

[5] [5]

and Sana, Faria and Samani, Joshua and Cooke, James and Kim, Joseph A

Pan, Steven C. and Sana, Faria and Samani, Joshua and Cooke, James and Kim, Joseph A. , title =. Memory , year =. doi:10.1080/09658211.2020.1815790 , pmid =

work page doi:10.1080/09658211.2020.1815790 2020

[6] [6]

Journal of Educational Measurement , volume=

Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations , author=. Journal of Educational Measurement , volume=. 2003 , publisher=

2003

[7] [7]

and Rock, Donald A

Bennett, Randy Elliot and Ward, William C. and Rock, Donald A. and LaHart, Charles , title =. 1990 , type =. doi:10.1002/j.2333-8504.1990.tb01348.x , url =

work page doi:10.1002/j.2333-8504.1990.tb01348.x 1990

[8] [8]

Research Monograph No

Criteria for Alignment of Expectations and Assessments in Mathematics and Science Education. Research Monograph No. 6. , author=. 1997 , publisher=

1997

[9] [9]

International journal of artificial intelligence in education , volume=

A systematic review of automatic question generation for educational purposes , author=. International journal of artificial intelligence in education , volume=. 2020 , publisher=

2020

[10] [10]

Theory into practice , volume=

A revision of Bloom's taxonomy: An overview , author=. Theory into practice , volume=. 2002 , publisher=

2002

[11] [11]

Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle=. G-Eval:

[12] [12]

Findings of the Association for Computational Linguistics: EMNLP 2023 , year=

A Closer Look into Automatic Evaluation Using Large Language Models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , year=

2023

[13] [13]

Education and Information Technologies , volume=

Exploring quality criteria and evaluation methods in automated question generation: A comprehensive survey , author=. Education and Information Technologies , volume=