pith. machine review for the scientific record. sign in

arxiv: 2511.17069 · v3 · submitted 2025-11-21 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

Interpretability from the Ground Up: Stakeholder-Centric Design of Automated Scoring in Educational Assessments

Authors on Pith no claims yet

Pith reviewed 2026-05-17 20:59 UTC · model grok-4.3

classification 💻 cs.CL
keywords automated scoringinterpretabilityeducational assessmentconstructed responsestakeholder analysisFGTI principlesAnalyticScorehuman alignment
0
0 comments X

The pith

Stakeholder analysis yields four principles that guide an automated scoring system to near state-of-the-art accuracy while remaining explainable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper begins by mapping the distinct transparency needs of students, teachers, administrators, and test developers in large-scale educational assessments. From those needs it derives four guiding principles for interpretability: faithfulness to the scoring criteria, groundedness in observable response evidence, traceability of each decision step, and interchangeability with human reasoning processes. It then constructs AnalyticScore as a working implementation of those principles for text-based constructed-response items. On ten items drawn from the ASAP-SAS dataset the resulting system exceeds the accuracy of several existing interpretable baselines and lands within an average of 0.06 quadratic weighted kappa of the strongest uninterpretable models. Its extracted features also match the choices made by human annotators performing the identical task.

Core claim

AnalyticScore applies the FGTI principles through explicit feature extraction and traceable scoring steps to produce grades for student open-ended text responses. Across the ten ASAP-SAS items it surpasses many prior interpretable methods in accuracy while remaining within 0.06 QWK, on average, of the current uninterpretable state of the art. Its feature selections align closely with those of human raters given the same featurization instructions.

What carries the argument

AnalyticScore, a reference framework that encodes the four FGTI principles through human-aligned, explicitly extracted features and step-wise traceable scoring.

If this is right

  • Automated scoring systems can satisfy transparency demands from multiple stakeholder groups without large accuracy penalties.
  • Feature choices that match human raters increase the chance that educators will understand and accept the scores.
  • The same four principles can serve as a design template for interpretable AI tools in other parts of education.
  • Traceable outputs let stakeholders identify exactly which parts of a response drove a given score.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the FGTI principles prove robust, the same stakeholder-first method could be applied to other high-stakes AI decisions such as admissions or hiring tools.
  • Real classroom pilots would be required to check whether the measured human alignment produces measurable gains in trust and fairness perceptions.
  • Extending the framework to handle spoken responses or multimodal submissions would test how well the principles travel beyond text.

Load-bearing premise

The four FGTI principles derived from stakeholder interviews are sufficient to satisfy the real interpretability requirements of every group that uses large-scale assessments.

What would settle it

A deployment study in which actual teachers or administrators report that AnalyticScore outputs still prevent them from explaining or contesting individual student scores as readily as they can with current human-scored rubrics.

Figures

Figures reproduced from arXiv: 2511.17069 by Candace Thille, Chris Piech, Joseph Tey, Mike Hardy, Yunsung Kim.

Figure 1
Figure 1. Figure 1: Schematic of the ANALYTICSCORE framework. The example question is: “Explain how pandas in China and koalas in Australia are similar, and how they both are different from pythons.” Principle 3 (Traceable). The scoring model should consist of subroutines that each represent a specific, well-defined evidentiary reasoning step on clearly specified inputs. Principle 4 (Interchangeable). A human should be able t… view at source ↗
Figure 2
Figure 2. Figure 2: shows the exact prompts used to implement the feature labeling function from Section 3.2. Students were asked the following question: ```[QUESTION PROMPT]``` Here are several examples of student responses to the question: Student Response: [RESPONSE TEXT] x 1000 Please tell me 15 short, simple, and representative statement, claims, or arguments that are common across many student responses and that disting… view at source ↗
Figure 3
Figure 3. Figure 3: Welcome page of one of the 3 Qualtrics forms used for the featurization alignment [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example description of the assessment item and assessment instruction shown to the [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Each annotation task presented annotators with a (response, analytic component) pair and [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
read the original abstract

AI-driven automated scoring systems offer scalable and efficient means of evaluating complex student-generated responses. Yet, despite increasing demand for transparency and interpretability, the field has yet to develop a widely accepted solution for interpretable automated scoring to be used in large-scale real-world assessments. This work takes a principled approach to address this challenge. We analyze the needs and potential benefits of interpretable automated scoring for various assessment stakeholder groups and develop four principles of interpretability -- (F)aithfulness, (G)roundedness, (T)raceability, and (I)nterchangeability (FGTI) -- targeted at those needs. To illustrate the feasibility of implementing these principles, we develop the AnalyticScore framework as a reference framework. When applied to the domain of text-based constructed-response scoring, AnalyticScore outperforms many uninterpretable scoring methods in terms of scoring accuracy and is, on average, within 0.06 QWK of the uninterpretable SOTA across 10 items from the ASAP-SAS dataset. By comparing against human annotators conducting the same featurization task, we further demonstrate that the featurization behavior of AnalyticScore aligns well with that of humans.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to address the lack of widely accepted interpretable automated scoring for large-scale educational assessments by first analyzing stakeholder needs and deriving four principles of interpretability—Faithfulness, Groundedness, Traceability, and Interchangeability (FGTI)—then illustrating their feasibility via the AnalyticScore reference framework. When applied to text-based constructed-response scoring, AnalyticScore is reported to outperform many uninterpretable methods and to stay within 0.06 QWK of uninterpretable SOTA on average across 10 ASAP-SAS items, while its featurization behavior aligns with that of human annotators on the same task.

Significance. If the central claims hold, the work offers a principled, stakeholder-derived approach to interpretability that could help bridge performance and transparency gaps in educational AI. Strengths include the explicit grounding in stakeholder analysis, the use of a public benchmark dataset for empirical comparison, and the human featurization alignment experiment, all of which provide concrete, falsifiable anchors for the feasibility argument.

major comments (3)
  1. [Stakeholder Analysis and FGTI Derivation] The derivation of the FGTI principles from stakeholder interviews is presented as foundational, yet the manuscript provides no separate validation (e.g., follow-up surveys, decision-making experiments, or operational deployment tests) to establish that these four axes are necessary and sufficient for the interpretability requirements of raters, administrators, or students under high-stakes conditions; this directly underpins the claim that FGTI captures stakeholder needs.
  2. [AnalyticScore Performance Evaluation] The performance claim that AnalyticScore remains within 0.06 QWK of uninterpretable SOTA rests on direct comparison to published baselines, but the manuscript supplies no implementation details, statistical significance tests, per-item variance, or ablation studies; without these, the feasibility demonstration for the FGTI principles cannot be rigorously assessed.
  3. [Human Featurization Comparison] The human featurization alignment result is limited to the same extraction task on ASAP-SAS items and does not test end-to-end traceability or interchangeability within live scoring workflows; this weakens support for the broader claim that AnalyticScore delivers stakeholder-usable interpretability via the FGTI principles.
minor comments (2)
  1. [Abstract] The abstract states that AnalyticScore 'outperforms many uninterpretable scoring methods' without naming the specific baselines or reporting the quantitative margins; adding these details would improve clarity.
  2. [Throughout] Ensure that all acronyms (e.g., QWK, ASAP-SAS) are defined at first use and used consistently in tables and figure captions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Stakeholder Analysis and FGTI Derivation] The derivation of the FGTI principles from stakeholder interviews is presented as foundational, yet the manuscript provides no separate validation (e.g., follow-up surveys, decision-making experiments, or operational deployment tests) to establish that these four axes are necessary and sufficient for the interpretability requirements of raters, administrators, or students under high-stakes conditions; this directly underpins the claim that FGTI captures stakeholder needs.

    Authors: The FGTI principles were systematically derived from the stakeholder interviews and needs analysis presented in Section 3 of the manuscript. We did not perform additional validation experiments such as follow-up surveys or operational deployment tests within the scope of this work, which instead prioritizes derivation followed by a feasibility demonstration via AnalyticScore. We agree that further validation would strengthen the foundational claims and will add a limitations subsection that explicitly discusses the current scope of the stakeholder analysis while outlining directions for future validation studies. revision: partial

  2. Referee: [AnalyticScore Performance Evaluation] The performance claim that AnalyticScore remains within 0.06 QWK of uninterpretable SOTA rests on direct comparison to published baselines, but the manuscript supplies no implementation details, statistical significance tests, per-item variance, or ablation studies; without these, the feasibility demonstration for the FGTI principles cannot be rigorously assessed.

    Authors: We acknowledge that the current presentation of results would benefit from greater rigor. In the revised manuscript we will add an appendix with full implementation details and hyperparameters, report per-item QWK values with associated variance, include statistical significance testing (e.g., paired tests) against the published baselines, and provide ablation studies that isolate the contribution of individual FGTI components to overall performance. revision: yes

  3. Referee: [Human Featurization Comparison] The human featurization alignment result is limited to the same extraction task on ASAP-SAS items and does not test end-to-end traceability or interchangeability within live scoring workflows; this weakens support for the broader claim that AnalyticScore delivers stakeholder-usable interpretability via the FGTI principles.

    Authors: The human featurization alignment experiment provides targeted evidence for the Groundedness principle by comparing feature extraction behavior on the public ASAP-SAS items. We recognize that this does not extend to end-to-end evaluation inside live operational scoring workflows. The manuscript's stated scope is a feasibility demonstration on the benchmark dataset; we will revise the discussion section to more clearly delineate this scope and to note that full workflow integration testing constitutes an important direction for subsequent research. revision: partial

Circularity Check

0 steps flagged

No circularity: performance and alignment claims rest on external empirical benchmarks

full rationale

The paper first extracts stakeholder needs via interviews to derive the four FGTI principles, then constructs the AnalyticScore framework to implement those principles, and finally reports direct empirical results: AnalyticScore achieves scoring accuracy competitive with uninterpretable SOTA (within 0.06 QWK on average across 10 ASAP-SAS items) and featurization behavior that aligns with human annotators on the same task. These outcomes are measured against published baselines on a public dataset and against independent human raters; they do not reduce by construction to any parameter fitted inside the paper, nor do they rely on a self-citation chain or an ansatz smuggled from prior author work. The derivation chain therefore remains self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of the stakeholder-needs analysis that produced the FGTI principles and on the assumption that the AnalyticScore implementation faithfully realizes those principles; no free parameters or new physical entities are introduced in the abstract.

axioms (1)
  • domain assumption Stakeholder needs analysis yields a complete and actionable set of interpretability requirements for automated scoring
    The paper derives the four FGTI principles directly from this analysis.

pith-pipeline@v0.9.0 · 5520 in / 1338 out tokens · 41620 ms · 2026-05-17T20:59:04.263620+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 4 internal anchors

  1. [1]

    The standards for educational and psychological testing

    AERA , APA , and NCME . The standards for educational and psychological testing. 2014

  2. [2]

    Chain-of-thought reasoning in the wild is not always faithful

    Iv \'a n Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679, 2025

  3. [3]

    Take no shortcuts! stick to the rubric: A method for building trustworthy short answer scoring models

    Yuya Asazuma, Hiroaki Funayama, Yuichiroh Matsubayashi, Tomoya Mizumoto, Paul Reisert, and Kentaro Inui. Take no shortcuts! stick to the rubric: A method for building trustworthy short answer scoring models. In International Conference on Higher Education Learning Methodologies and Technologies Online, pages 337--358. Springer, 2023

  4. [4]

    Cognitive foundations of automated scoring

    Malcolm I Bauer and Diego Zapata-Rivera. Cognitive foundations of automated scoring. In Handbook of automated scoring, pages 13--28. Chapman and Hall/CRC, 2020

  5. [5]

    Automated scoring with validity in mind

    Isaac I Bejar, Robert J Mislevy, and Mo Zhang. Automated scoring with validity in mind. The Wiley handbook of cognition and assessment: Frameworks, methodologies, and applications, pages 226--246, 2016

  6. [6]

    Moving the field forward: Some thoughts on validity and automated scoring

    Randy Elliot Bennett. Moving the field forward: Some thoughts on validity and automated scoring. Automated scoring of complex tasks in computer-based testing, pages 403--412, 2006

  7. [7]

    Validity and automad scoring: It's not only the scoring

    Randy Elliot Bennett and Isaac I Bejar. Validity and automad scoring: It's not only the scoring. Educational Measurement: Issues and Practice, 17 0 (4): 0 9--17, 1998

  8. [8]

    Validity and automated scoring

    Randy Elliot Bennett and Mo Zhang. Validity and automated scoring. In Technology and testing, pages 142--173. Routledge, 2015

  9. [9]

    What use is educational assessment?, 2019

    Amy I Berman, Michael J Feuer, and James W Pellegrino. What use is educational assessment?, 2019

  10. [10]

    Explainable machine learning in deployment

    Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir Puri, Jos \'e MF Moura, and Peter Eckersley. Explainable machine learning in deployment. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pages 648--657, 2020

  11. [11]

    Assessment and classroom learning

    Paul Black and Dylan Wiliam. Assessment and classroom learning. Assessment in Education: principles, policy & practice, 5 0 (1): 0 7--74, 1998

  12. [12]

    Explainable automatic grading with neural additive models

    Aubrey Condor and Zachary Pardos. Explainable automatic grading with neural additive models. In International Conference on Artificial Intelligence in Education, pages 18--31. Springer, 2024

  13. [13]

    The effects of explanations in automated essay scoring systems on student trust and motivation

    Rianne Conijn, Patricia Kahr, and Chris CP Snijders. The effects of explanations in automated essay scoring systems on student trust and motivation. Journal of Learning Analytics, 10 0 (1): 0 37--53, 2023

  14. [14]

    Qlora: Efficient finetuning of quantized llms

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36: 0 10088--10115, 2023

  15. [15]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171--4186, 2019

  16. [16]

    Assessment design with automated scoring in mind

    Kristen DiCerbo, Emily Lai, and Ventura Matthew. Assessment design with automated scoring in mind. In Handbook of Automated Scoring, pages 29--48. Chapman and Hall/CRC, 2020

  17. [17]

    Validity arguments for ai-based automated scores: Essay scoring as an illustration

    Steve Ferrara and Saed Qunbar. Validity arguments for ai-based automated scores: Essay scoring as an illustration. Journal of Educational Measurement, 59 0 (3): 0 288--313, 2022

  18. [18]

    The past, present, and future of automated scoring

    Peter W Foltz, Duanli Yan, and Andr \'e A Rupp. The past, present, and future of automated scoring. In Handbook of Automated Scoring, pages 1--10. Chapman and Hall/CRC, 2020

  19. [19]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  20. [20]

    Teachers' summative practices and assessment for learning--tensions and synergies

    Wynne Harlen. Teachers' summative practices and assessment for learning--tensions and synergies. Curriculum Journal, 16 0 (2): 0 207--223, 2005

  21. [21]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020

  22. [22]

    Ethics of ai in education: Towards a community-wide framework

    Wayne Holmes, Kaska Porayska-Pomsta, Ken Holstein, Emma Sutherland, Toby Baker, Simon Buckingham Shum, Olga C Santos, Mercedes T Rodrigo, Mutlu Cukurova, Ig Ibert Bittencourt, et al. Ethics of ai in education: Towards a community-wide framework. International Journal of Artificial Intelligence in Education, pages 1--23, 2022

  23. [23]

    Math autoscoring is finally here—let's tap its potential for improving student performance

    Institute of Education Statistics . Math autoscoring is finally here—let's tap its potential for improving student performance. https://ies.ed.gov/learn/blog/math-autoscoring-finally-here-lets-tap-its-potential-improving-student-performance, Oct 2023. [Accessed: Feb 21. 2025]

  24. [24]

    Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198--4205, 2020

  25. [25]

    Explainable artificial intelligence in education

    Hassan Khosravi, Simon Buckingham Shum, Guanliang Chen, Cristina Conati, Yi-Shan Tsai, Judy Kay, Simon Knight, Roberto Martinez-Maldonado, Shazia Sadiq, and Dragan Ga s evi \'c . Explainable artificial intelligence in education. Computers and education: artificial intelligence, 3: 0 100074, 2022

  26. [26]

    Concept bottleneck models

    Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In International conference on machine learning, pages 5338--5348. PMLR, 2020

  27. [27]

    Content analysis: An introduction to its methodology

    Klaus Krippendorff. Content analysis: An introduction to its methodology. Sage publications, 2018

  28. [28]

    Explainable automated essay scoring: Deep learning really has pedagogical value

    Vivekanandan Kumar and David Boulanger. Explainable automated essay scoring: Deep learning really has pedagogical value. In Frontiers in education, volume 5, page 572367. Frontiers Media SA, 2020

  29. [29]

    Automated essay scoring and the deep learning black box: How are rubric scores determined? International Journal of Artificial Intelligence in Education, 31 0 (3): 0 538--584, 2021

    Vivekanandan S Kumar and David Boulanger. Automated essay scoring and the deep learning black box: How are rubric scores determined? International Journal of Artificial Intelligence in Education, 31 0 (3): 0 538--584, 2021

  30. [30]

    Get it scored using autosas—an automated system for scoring short answers

    Yaman Kumar, Swati Aggarwal, Debanjan Mahata, Rajiv Ratn Shah, Ponnurangam Kumaraguru, and Roger Zimmermann. Get it scored using autosas—an automated system for scoring short answers. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 9662--9669, 2019

  31. [31]

    C-rater: Automated scoring of short-answer questions

    Claudia Leacock and Martin Chodorow. C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37 0 (4): 0 389--405, 2003

  32. [32]

    Applying large language models and chain-of-thought for automatic scoring

    Gyeong-Geon Lee, Ehsan Latif, Xuansheng Wu, Ninghao Liu, and Xiaoming Zhai. Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, 6: 0 100213, 2024

  33. [33]

    An automated explainable educational assessment system built on llms

    Jiazheng Li, Artem Bobrov, David West, Cesare Aloisi, and Yulan He. An automated explainable educational assessment system built on llms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29658--29660, 2025

  34. [34]

    Answer-state recurrent relational network (asrrn) for constructed response assessment and feedback grouping

    Zhaohui Li, Susan Lloyd, Matthew Beckman, and Rebecca J Passonneau. Answer-state recurrent relational network (asrrn) for constructed response assessment and feedback grouping. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3879--3891, 2023

  35. [35]

    The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery

    Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16 0 (3): 0 31--57, 2018

  36. [36]

    Fairness, accountability, transparency, and ethics (fate) in artificial intelligence (ai) and higher education: A systematic review

    Bahar Memarian and Tenzin Doleck. Fairness, accountability, transparency, and ethics (fate) in artificial intelligence (ai) and higher education: A systematic review. Computers and Education: Artificial Intelligence, 5: 0 100152, 2023

  37. [37]

    An evidentiary-reasoning perspective on automated scoring: Commentary on part i

    Robert J Mislevy. An evidentiary-reasoning perspective on automated scoring: Commentary on part i. In Handbook of Automated Scoring, pages 151--168. Chapman and Hall/CRC, 2020

  38. [38]

    The pragmatic turn in explainable artificial intelligence (xai)

    Andr \'e s P \'a ez. The pragmatic turn in explainable artificial intelligence (xai). Minds and Machines, 29 0 (3): 0 441--459, 2019

  39. [39]

    On the consistency of ordinal regression methods

    Fabian Pedregosa, Francis Bach, and Alexandre Gramfort. On the consistency of ordinal regression methods. Journal of Machine Learning Research, 18 0 (55): 0 1--35, 2017

  40. [40]

    Pellegrino

    James W. Pellegrino. A Learning Sciences Perspective on the Design and Use of Assessment in Education, page 238–258. Cambridge Handbooks in Psychology. Cambridge University Press, 2022

  41. [41]

    Stakeholders in Explainable AI

    Alun Preece, Dan Harborne, Dave Braines, Richard Tomsett, and Supriyo Chakraborty. Stakeholders in explainable ai. arXiv preprint arXiv:1810.00184, 2018

  42. [42]

    Loss functions for preference levels: Regression with discrete ordered labels

    Jason DM Rennie and Nathan Srebro. Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling, volume 1, pages 1--6. AAAI Press, Menlo Park, CA, 2005

  43. [43]

    Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

    Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1 0 (5): 0 206--215, 2019

  44. [44]

    Designing, evaluating, and deploying automated scoring systems with validity in mind: Methodological design decisions

    Andr \'e A Rupp. Designing, evaluating, and deploying automated scoring systems with validity in mind: Methodological design decisions. Applied Measurement in Education, 31 0 (3): 0 191--214, 2018

  45. [45]

    Large language models cannot explain themselves

    Advait Sarkar. Large language models cannot explain themselves. arXiv preprint arXiv:2405.04382, 2024

  46. [46]

    Explainability in automatic short answer grading

    Tim Schlippe, Quintus Stierstorfer, Maurice ten Koppel, and Paul Libbrecht. Explainability in automatic short answer grading. In International conference on artificial intelligence in education technology, pages 69--87. Springer, 2022

  47. [47]

    The ABCs of how we learn: 26 scientifically proven approaches, how they work, and when to use them

    Daniel L Schwartz, Jessica M Tsang, and Kristen P Blair. The ABCs of how we learn: 26 scientifically proven approaches, how they work, and when to use them. WW Norton & Company, 2016

  48. [48]

    Contrasting state-of-the-art in the machine scoring of short-form constructed responses

    Mark D Shermis. Contrasting state-of-the-art in the machine scoring of short-form constructed responses. Educational Assessment, 20 0 (1): 0 46--65, 2015

  49. [49]

    Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36: 0 74952--74965, 2023

  50. [50]

    Large language models are not fair evaluators

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, et al. Large language models are not fair evaluators. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9440--9450, 2024

  51. [51]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

  52. [52]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

  53. [53]

    Results of naep math item automated scoring data challenge & comparison between reading & math challenges

    John Whitmer and Magdalen Beiting-Parrish. Results of naep math item automated scoring data challenge & comparison between reading & math challenges. 2023

  54. [54]

    Lessons learned about transparency, fairness, and explainability from two automated scoring challenges

    John Whitmer and Magdalen Beiting-Parrish. Lessons learned about transparency, fairness, and explainability from two automated scoring challenges. In AI for Education: Bridging Innovation and Responsibility, 2024

  55. [55]

    Embedded formative assessment

    Dylan Wiliam. Embedded formative assessment. Solution tree press, 2011

  56. [56]

    A framework for evaluation and use of automated scoring

    David M Williamson, Xiaoming Xi, and F Jay Breyer. A framework for evaluation and use of automated scoring. Educational measurement: issues and practice, 31 0 (1): 0 2--13, 2012

  57. [57]

    Language in a bottle: Language model guided concept bottlenecks for interpretable image classification

    Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19187--19197, 2023