Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact
Pith reviewed 2026-05-15 18:41 UTC · model grok-4.3
The pith
LLMs share behavioral biases that align poorly with expert human teaching and can oppose intended student learning outcomes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across all LLMs, inter-model behaviors on disparate teaching tasks correlate higher than they do with expert human behaviors on target tasks. These shared biases are poorly aligned with downstream measures of teaching quality and often negatively aligned with the intended impact of student learning outcomes. Selection of LLM or prompting strategy accounts for only 15 percent of measured misalignment error, while variation in error is shared across models, indicating that common pretraining drives much of the misalignment. Multi-model ensembles further exacerbate the divergence from learning goals.
What carries the argument
Alignment measurement between LLM outputs, expert human reference behaviors, and intended-impact metrics on teaching and learning tasks for schoolchildren.
If this is right
- LLM selection and prompting explain only a small fraction of misalignment, so changing them alone will not fix alignment with learning goals.
- Unanimous voting or benchmark-weighted ensembles of LLMs increase misalignment with intended student outcomes.
- Common pretraining data creates shared biases that diverge from expert teaching practices.
- Applications of LLMs in high-noise educational settings require direct measurement against intended impact rather than benchmark scores.
- Practical deployment in teaching needs methods that go beyond current model behaviors to reach intended learning results.
Where Pith is reading between the lines
- The pattern suggests that simply scaling model size or data volume without changing pretraining objectives may widen the gap between model behavior and real-world human goals.
- Similar misalignment risks could appear in other domains where intended impact is hard to verify, such as medical advice or policy drafting.
- Targeted fine-tuning on expert outcome data rather than benchmarks might reduce the shared biases observed here.
- Developers could test whether altering pretraining mixtures reduces the inter-model correlation excess on human-centric tasks.
Load-bearing premise
The chosen teaching and learning tasks for schoolchildren accurately capture the intended impacts on student outcomes, and expert human behaviors provide the right reference standard for alignment.
What would settle it
An experiment that directly measures student learning gains when using LLM-assisted teaching and finds positive outcomes despite the reported misalignment patterns.
Figures
read the original abstract
LLMs increasingly excel on AI benchmarks, but doing so does not guarantee validity for downstream tasks. This study contrasts LLM alignment on benchmarks, downstream tasks, and, importantly the intended impact of those tasks. We evaluate the performance of leading LLMs (i.e., generative pre-trained base models) on difficult-to-verify tasks of the teaching and learning of schoolchildren. Across all LLMs, inter-model behaviors on disparate tasks correlate higher than they do with expert human behaviors on target tasks. These biases shared across LLMs are poorly aligned with downstream measures of teaching quality and often negatively aligned with the intended impact of student learning outcomes. Further, we find multi-model ensembles, both unanimous model voting and expert-weighting by benchmark performance, further exacerbate misalignment with learning. We measure that selection of LLM and/or prompting strategy only reliably accounts for $15\%$ of all measured misalignment error and that variation in misalignment error is shared across LLMs, suggesting that common pretraining accounts for much of the misalignment in these tasks. We demonstrate methods for robustly measuring alignment of complex tasks and provide unique insights into practical applications of LLMs in high-noise contexts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates leading LLMs on difficult-to-verify teaching and learning tasks for schoolchildren. It reports that inter-model behavioral correlations across disparate tasks exceed correlations with expert human behaviors on the same tasks. These shared LLM biases are described as poorly aligned with downstream teaching-quality metrics and often negatively aligned with intended student learning outcomes. Ensembles (unanimous voting or benchmark-weighted) are shown to worsen misalignment. The authors attribute most misalignment variance to common pretraining rather than model or prompt choice (claiming the latter accounts for only 15% of error) and present methods for measuring alignment in noisy, complex tasks.
Significance. If the empirical patterns hold after proper validation, the work would usefully document that benchmark-optimized LLMs can systematically diverge from pedagogically desirable behaviors even when surface performance appears strong. The emphasis on intended downstream impact rather than benchmark accuracy alone is a constructive framing for high-stakes deployment questions.
major comments (3)
- [Abstract / Methods] Abstract and methods: The central claim that LLM biases are 'often negatively aligned with the intended impact of student learning outcomes' rests on proxy tasks whose validity against actual learning gains is not demonstrated. No regression, controlled trial, or correlation with measurable retention or skill acquisition is reported to link the chosen teaching-quality metrics to real student outcomes.
- [Results] Results section: The statement that 'selection of LLM and/or prompting strategy only reliably accounts for 15% of all measured misalignment error' requires an explicit variance-decomposition procedure (e.g., ANOVA, hierarchical model, or ablation across models/prompts). Without the statistical test, error bars, or exclusion criteria, it is impossible to assess whether the 15% figure is robust or an artifact of the chosen tasks.
- [Discussion] Discussion: The attribution of shared misalignment primarily to 'common pretraining' is plausible but currently circular; the paper compares LLMs to human experts rather than to models with controlled pretraining differences. A direct test (e.g., comparing base vs. instruction-tuned variants or models trained on different corpora) is needed to support the causal claim.
minor comments (2)
- [Abstract] The abstract would benefit from a concise operational definition of 'intended impact' and 'alignment' before stating the negative-alignment result.
- [Figures/Tables] Figure and table captions should explicitly state the number of models, tasks, and human raters used in each correlation analysis.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below with point-by-point responses. Revisions have been made to the manuscript where they strengthen clarity or address valid concerns without altering the core empirical findings.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and methods: The central claim that LLM biases are 'often negatively aligned with the intended impact of student learning outcomes' rests on proxy tasks whose validity against actual learning gains is not demonstrated. No regression, controlled trial, or correlation with measurable retention or skill acquisition is reported to link the chosen teaching-quality metrics to real student outcomes.
Authors: We agree that direct validation against measurable student outcomes (e.g., retention or skill acquisition in controlled trials) would provide stronger grounding. The manuscript intentionally focuses on alignment with expert-defined intended impacts rather than post-hoc measured gains, as the latter would require longitudinal classroom studies outside the paper's scope. The proxy tasks are drawn from standard curricula with explicit learning objectives, and negative correlations are reported against these intended outcomes. We have revised the abstract and methods to explicitly label the metrics as proxies and added a limitations subsection discussing the lack of direct outcome correlations. revision: partial
-
Referee: [Results] Results section: The statement that 'selection of LLM and/or prompting strategy only reliably accounts for 15% of all measured misalignment error' requires an explicit variance-decomposition procedure (e.g., ANOVA, hierarchical model, or ablation across models/prompts). Without the statistical test, error bars, or exclusion criteria, it is impossible to assess whether the 15% figure is robust or an artifact of the chosen tasks.
Authors: The 15% figure originates from an ablation study that systematically varied models and prompts while holding tasks fixed and measuring the resulting change in misalignment scores. To make this fully transparent, we have added an explicit variance decomposition using a linear mixed-effects model (with model and prompt as fixed effects and task as random effect) to the revised results section, including the full ANOVA table, standard errors, and task exclusion criteria based on inter-rater reliability thresholds. revision: yes
-
Referee: [Discussion] Discussion: The attribution of shared misalignment primarily to 'common pretraining' is plausible but currently circular; the paper compares LLMs to human experts rather than to models with controlled pretraining differences. A direct test (e.g., comparing base vs. instruction-tuned variants or models trained on different corpora) is needed to support the causal claim.
Authors: The attribution rests on the observation that misalignment variance is highly shared across models despite architectural and fine-tuning differences, while model/prompt choice explains only 15%. This pattern is consistent with common pretraining data as the dominant source. We have expanded the discussion to articulate this reasoning more explicitly, cite supporting literature on pretraining corpus effects, and acknowledge that a controlled base-versus-tuned comparison would offer stronger causal evidence. Such an experiment is noted as valuable future work. revision: partial
Circularity Check
No significant circularity; alignment measured via external expert human behaviors and downstream outcome proxies
full rationale
The paper derives its misalignment claims through empirical correlations between LLM outputs on teaching tasks and independent reference standards consisting of expert human behaviors plus downstream measures of teaching quality and student learning outcomes. These references are external to the LLMs and not constructed from the models' fitted parameters, self-defined metrics, or prior self-citations. No load-bearing step reduces by definition or construction to the inputs (e.g., no inter-model correlation is fitted and then relabeled as a prediction of misalignment). The analysis therefore remains self-contained against verifiable external benchmarks rather than circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert human behaviors on teaching tasks represent the appropriate baseline for alignment with intended student learning outcomes.
Reference graph
Works this paper leans on
-
[1]
The Rapid Adoption of Generative AI. Anthony J. Bishara and James B. Hittner. 2017. Confi- dence intervals for correlations when data are not nor- mal.Behavior Research Methods, 49(1):294–309. David Blazar, David Braslow, Charalambos Y . Char- alambous, and Heather C. Hill. 2017. At- tending to General and Mathematics-Specific Dimensions of Teaching: Expl...
-
[2]
Technical report, Center for American Progress, Washington, D.C
The Hidden Value of Curriculum Reform. Technical report, Center for American Progress, Washington, D.C. Megan Brenan. 2021. K-12 Parents Remain Largely Satisfied With Child’s Education. Section: Educa- tion. Robert L. Brennan. 2001a. Advanced Topics in Univari- ate Generalizability Theory. In Robert L. Brennan, editor,Generalizability Theory, Statistics f...
-
[3]
arXiv preprint arXiv:2403.02419 , year=
Are More LLM Calls All You Need? To- wards Scaling Laws of Compound Inference Systems. arXiv preprint. ArXiv:2403.02419 [cs]. Paul Christiano, Jan Leike, Tom B. Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. 2023. Deep re- inforcement learning from human preferences.arXiv preprint. ArXiv:1706.03741. Elizabeth Chu, Andrea Clay, and Grace McCarty. 20...
-
[4]
ISSN: 2692-8205 Pages: 2025.10.16.679418 Section: New Results
LabOS: The AI-XR Co-Scientist That Sees and Works With Humans. ISSN: 2692-8205 Pages: 2025.10.16.679418 Section: New Results. Roderic N. Crooks. 2024.Access Is Capture: How Edtech Reproduces Racial Inequality. Univ of Cali- fornia Press. Google-Books-ID: q1ANEQAAQBAJ. Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beut...
-
[5]
Foundational Skills to Support Reading for Un- derstanding in Kindergarten through 3rd Grade. Edu- cator’s Practice Guide. NCEE 2016-4008. Technical report, What Works Clearinghouse. ERIC Number: ED566956. Sebastian Gehrmann, Elizabeth Clark, and Thibault Sel- lam. 2022. Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for G...
-
[6]
Evaluation Gaps in Machine Learning Practice. arXiv preprint. ArXiv:2205.05256 [cs]. Irina Jurenka, Markus Kunesch, Kevin R McKee, Daniel Gillick, Shaojian Zhu, Shubham Milind Phal, Kather- ine Hermann, Daniel Kasenberg, Avishkar Bhoopc- hand, Ankit Anand, Miruna Pîslar, Stephanie Chan, Lisa Wang, Jennifer She, Parsa Mahmoudieh, Wei- Jen Ko, Andrea Huber,...
-
[7]
Teaching Higher: Educators’ Perspectives on Common Core Implementation. Technical report. Thomas J. Kane and Douglas O. Staiger. 2012. Gather- ing Feedback for Teaching: Combining High-Quality Observations with Student Surveys and Achievement Gains. Research Paper. MET Project. Technical re- port, Bill & Melinda Gates Foundation. Publication Title: Bill &...
-
[8]
Correlated Errors in Large Language Models
Correlated Errors in Large Language Models. arXiv preprint. ArXiv:2506.07962 [cs] version: 1. Sunnie S. Y . Kim, Q. Vera Liao, Mihaela V orvoreanu, Stephanie Ballard, and Jennifer Wortman Vaughan
-
[9]
"I’m Not Sure, But...": Examining the Im- pact of Large Language Models’ Uncertainty Expres- sion on User Reliance and Trust.arXiv preprint. ArXiv:2405.00623. René F. Kizilcec. 2024. To Advance AI Use in Edu- cation, Focus on Understanding Educators.Interna- tional Journal of Artificial Intelligence in Education, 34(1):12–19. Artur Klingbeil, Cassandra Gr...
-
[10]
Place: US. Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Martiño Ríos-García, Benedict Emoekabu, Aswanth Krishnan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, Mehrdad Asgari, Juliane Eberhardt, Amir Mohammad Elahi, Hani M. Elbeheiry, María Victoria Gil, Christina Glaubitz, Maximilian Greiner, Caroline T. Holick, Tim Hoffm...
work page internal anchor Pith review arXiv 2025
-
[11]
ArXiv:2407.01873 [cs] version: 1
Automated Text Scoring in the Age of Generative AI for the GPU-poor.arXiv preprint. ArXiv:2407.01873 [cs] version: 1. Zachary A. Pardos and Shreya Bhandari. 2023. Learn- ing gain differences between ChatGPT and hu- man tutor generated algebra hints.arXiv preprint. ArXiv:2302.06871 [cs]. Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele W...
-
[12]
Articulating the “How,” the “For What,” the “For Whom,” and the “With Whom” in Concert: A Call to Broaden the Benchmarks of our Scholarship. Cognition and Instruction, 36(2):83–88. _eprint: https://doi.org/10.1080/07370008.2018.1413530. Robert C. Pianta, Jay Belsky, Nathan Vandergrift, Re- nate Houts, and Fred J. Morrison. 2008. Classroom Effects on Child...
-
[13]
Are emergent abilities of large lan- guage models a mirage?arXiv preprint arXiv:2304.15004, 2023
Are Emergent Abilities of Large Lan- guage Models a Mirage?arXiv preprint. ArXiv:2304.15004 [cs]. Daniel L. Schwartz, Jessica M. Tsang, and Kristen P. Blair. 2016.The ABCs of how we learn: 26 scientifi- cally proven approaches, how they work, and when to use them, first edition edition. Norton books in education. W.W. Norton & Company, New York. Pranab Ku...
-
[14]
Evaluating Large Language Models in Scientific Discovery
Evaluating Large Language Models in Scien- tific Discovery.arXiv preprint. ArXiv:2512.15567 [cs]. Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christo- pher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, Tim Althoff, and Yejin Choi. 2024. A Roadmap to Pluralistic Alignment. arXiv preprint. ArXiv:2...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Measuring and testing dependence by cor- relation of distances.The Annals of Statistics, 35(6):2769–2794. Anaïs Tack, Ekaterina Kochmar, Zheng Yuan, Serge Bibauw, and Chris Piech. 2023. The BEA 2023 Shared Task on Generating AI Teacher Re- sponses in Educational Dialogues.arXiv preprint. ArXiv:2306.06941. TNTP. The Opportunity Myth. TNTP. 2024. The Opport...
-
[16]
Think step-by-step how you would rate the instructional dialogue of the teacher on a scale of 1-7 (low-high). Instructional dialogue captures the purposeful use of content-focused discussion among teachers and students that is cumulative, with the teacher supporting students to chain ideas together in ways that lead to deeper understanding of content. Stu...
-
[17]
Provide your rating as a number between 1 and 7. Format your answer as: Reasoning: Rating (only specify a number between 1-7): Reasoning: Additionally, while not the focus of this study, we replicated the SOTA models of (Hardy, 2025b) to have confidence that the misalignment we ob- served were not the result of an impossible task using only transcripts. T...
-
[18]
This results in 103,148 total observations across models, tasks and prompts. Additionally, while not the focus of this study, we replicated the SOTA models of (Hardy, 2025b) to have confidence that the misalignment we observed were not the result of an impossible task using only transcripts. These encoders and those from (Hardy, 2025b) are shown as baseli...
work page 1968
-
[19]
This suggests that the estimated item-transcript scores in this study achieve approximately the tar- get level of consistency expected for these types of data (0.65, see Ho and Kane 2013). C.4.3 Interpretation These results imply that prompt-induced shifts are largely additive and limited in magnitude rather than transformative. While unexplored prompts m...
work page 2013
-
[20]
when performing this transformation. We report the results without this transformation both for simplicity and to better preserve the alignment nature ofτ. D.3 Expert Ensembling Conventional wisdom suggests that ensembling multiple models improves robustness and accuracy by leveraging diverse model strengths or averaging out independent errors. Our findin...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.