arxiv: 2511.17069 · v3 · submitted 2025-11-21 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

Interpretability from the Ground Up: Stakeholder-Centric Design of Automated Scoring in Educational Assessments

Yunsung Kim , Mike Hardy , Joseph Tey , Candace Thille , Chris Piech

Authors on Pith no claims yet

Pith reviewed 2026-05-17 20:59 UTC · model grok-4.3

classification 💻 cs.CL

keywords automated scoringinterpretabilityeducational assessmentconstructed responsestakeholder analysisFGTI principlesAnalyticScorehuman alignment

0 comments

The pith

Stakeholder analysis yields four principles that guide an automated scoring system to near state-of-the-art accuracy while remaining explainable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper begins by mapping the distinct transparency needs of students, teachers, administrators, and test developers in large-scale educational assessments. From those needs it derives four guiding principles for interpretability: faithfulness to the scoring criteria, groundedness in observable response evidence, traceability of each decision step, and interchangeability with human reasoning processes. It then constructs AnalyticScore as a working implementation of those principles for text-based constructed-response items. On ten items drawn from the ASAP-SAS dataset the resulting system exceeds the accuracy of several existing interpretable baselines and lands within an average of 0.06 quadratic weighted kappa of the strongest uninterpretable models. Its extracted features also match the choices made by human annotators performing the identical task.

Core claim

AnalyticScore applies the FGTI principles through explicit feature extraction and traceable scoring steps to produce grades for student open-ended text responses. Across the ten ASAP-SAS items it surpasses many prior interpretable methods in accuracy while remaining within 0.06 QWK, on average, of the current uninterpretable state of the art. Its feature selections align closely with those of human raters given the same featurization instructions.

What carries the argument

AnalyticScore, a reference framework that encodes the four FGTI principles through human-aligned, explicitly extracted features and step-wise traceable scoring.

If this is right

Automated scoring systems can satisfy transparency demands from multiple stakeholder groups without large accuracy penalties.
Feature choices that match human raters increase the chance that educators will understand and accept the scores.
The same four principles can serve as a design template for interpretable AI tools in other parts of education.
Traceable outputs let stakeholders identify exactly which parts of a response drove a given score.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the FGTI principles prove robust, the same stakeholder-first method could be applied to other high-stakes AI decisions such as admissions or hiring tools.
Real classroom pilots would be required to check whether the measured human alignment produces measurable gains in trust and fairness perceptions.
Extending the framework to handle spoken responses or multimodal submissions would test how well the principles travel beyond text.

Load-bearing premise

The four FGTI principles derived from stakeholder interviews are sufficient to satisfy the real interpretability requirements of every group that uses large-scale assessments.

What would settle it

A deployment study in which actual teachers or administrators report that AnalyticScore outputs still prevent them from explaining or contesting individual student scores as readily as they can with current human-scored rubrics.

Figures

Figures reproduced from arXiv: 2511.17069 by Candace Thille, Chris Piech, Joseph Tey, Mike Hardy, Yunsung Kim.

**Figure 1.** Figure 1: Schematic of the ANALYTICSCORE framework. The example question is: “Explain how pandas in China and koalas in Australia are similar, and how they both are different from pythons.” Principle 3 (Traceable). The scoring model should consist of subroutines that each represent a specific, well-defined evidentiary reasoning step on clearly specified inputs. Principle 4 (Interchangeable). A human should be able t… view at source ↗

**Figure 2.** Figure 2: shows the exact prompts used to implement the feature labeling function from Section 3.2. Students were asked the following question: ```[QUESTION PROMPT]``` Here are several examples of student responses to the question: Student Response: [RESPONSE TEXT] x 1000 Please tell me 15 short, simple, and representative statement, claims, or arguments that are common across many student responses and that disting… view at source ↗

**Figure 3.** Figure 3: Welcome page of one of the 3 Qualtrics forms used for the featurization alignment [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: Example description of the assessment item and assessment instruction shown to the [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Each annotation task presented annotators with a (response, analytic component) pair and [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

read the original abstract

AI-driven automated scoring systems offer scalable and efficient means of evaluating complex student-generated responses. Yet, despite increasing demand for transparency and interpretability, the field has yet to develop a widely accepted solution for interpretable automated scoring to be used in large-scale real-world assessments. This work takes a principled approach to address this challenge. We analyze the needs and potential benefits of interpretable automated scoring for various assessment stakeholder groups and develop four principles of interpretability -- (F)aithfulness, (G)roundedness, (T)raceability, and (I)nterchangeability (FGTI) -- targeted at those needs. To illustrate the feasibility of implementing these principles, we develop the AnalyticScore framework as a reference framework. When applied to the domain of text-based constructed-response scoring, AnalyticScore outperforms many uninterpretable scoring methods in terms of scoring accuracy and is, on average, within 0.06 QWK of the uninterpretable SOTA across 10 items from the ASAP-SAS dataset. By comparing against human annotators conducting the same featurization task, we further demonstrate that the featurization behavior of AnalyticScore aligns well with that of humans.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FGTI principles from stakeholder input plus AnalyticScore give a workable reference for interpretable scoring, but the human featurization match on ASAP-SAS is too narrow to confirm real usability.

read the letter

The punchline is that this paper starts from stakeholder interviews to define four interpretability principles for educational scoring—Faithfulness, Groundedness, Traceability, and Interchangeability—and then shows a reference framework that stays within 0.06 QWK of the top black-box models on ten ASAP-SAS items while matching human feature choices on the same task. That combination is the main thing to know. What is actually new is the explicit derivation of those FGTI principles from the needs of raters, teachers, administrators, and students rather than from model internals alone. The AnalyticScore implementation then serves as a concrete example that can be compared directly to existing methods. The paper does well at reporting the performance numbers on a public dataset and at including the human annotator comparison for the featurization step, which gives the groundedness claim some empirical footing instead of leaving it as an assertion. The soft spots are proportionate to the evidence presented. The human alignment result is limited to feature extraction on the same items and does not test whether the outputs would actually support decision-making in live, high-stakes scoring. There is also no separate validation that the four principles cover what stakeholders require under operational conditions, and the abstract gives no implementation details, ablations, or statistical tests that would let a reader judge robustness. This leaves the central claim that the approach delivers stakeholder-usable interpretability only moderately supported so far. This paper is for researchers and practitioners working on AI in education and large-scale assessment design. A reader who wants concrete ideas for moving beyond accuracy-only systems would find the framework and the principle list useful as a starting point. It deserves a serious referee because the problem is practical, the dataset comparison is straightforward, and the stakeholder angle is a reasonable direction even with the current gaps in validation. I would recommend sending it out for peer review rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The paper claims to address the lack of widely accepted interpretable automated scoring for large-scale educational assessments by first analyzing stakeholder needs and deriving four principles of interpretability—Faithfulness, Groundedness, Traceability, and Interchangeability (FGTI)—then illustrating their feasibility via the AnalyticScore reference framework. When applied to text-based constructed-response scoring, AnalyticScore is reported to outperform many uninterpretable methods and to stay within 0.06 QWK of uninterpretable SOTA on average across 10 ASAP-SAS items, while its featurization behavior aligns with that of human annotators on the same task.

Significance. If the central claims hold, the work offers a principled, stakeholder-derived approach to interpretability that could help bridge performance and transparency gaps in educational AI. Strengths include the explicit grounding in stakeholder analysis, the use of a public benchmark dataset for empirical comparison, and the human featurization alignment experiment, all of which provide concrete, falsifiable anchors for the feasibility argument.

major comments (3)

[Stakeholder Analysis and FGTI Derivation] The derivation of the FGTI principles from stakeholder interviews is presented as foundational, yet the manuscript provides no separate validation (e.g., follow-up surveys, decision-making experiments, or operational deployment tests) to establish that these four axes are necessary and sufficient for the interpretability requirements of raters, administrators, or students under high-stakes conditions; this directly underpins the claim that FGTI captures stakeholder needs.
[AnalyticScore Performance Evaluation] The performance claim that AnalyticScore remains within 0.06 QWK of uninterpretable SOTA rests on direct comparison to published baselines, but the manuscript supplies no implementation details, statistical significance tests, per-item variance, or ablation studies; without these, the feasibility demonstration for the FGTI principles cannot be rigorously assessed.
[Human Featurization Comparison] The human featurization alignment result is limited to the same extraction task on ASAP-SAS items and does not test end-to-end traceability or interchangeability within live scoring workflows; this weakens support for the broader claim that AnalyticScore delivers stakeholder-usable interpretability via the FGTI principles.

minor comments (2)

[Abstract] The abstract states that AnalyticScore 'outperforms many uninterpretable scoring methods' without naming the specific baselines or reporting the quantitative margins; adding these details would improve clarity.
[Throughout] Ensure that all acronyms (e.g., QWK, ASAP-SAS) are defined at first use and used consistently in tables and figure captions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Stakeholder Analysis and FGTI Derivation] The derivation of the FGTI principles from stakeholder interviews is presented as foundational, yet the manuscript provides no separate validation (e.g., follow-up surveys, decision-making experiments, or operational deployment tests) to establish that these four axes are necessary and sufficient for the interpretability requirements of raters, administrators, or students under high-stakes conditions; this directly underpins the claim that FGTI captures stakeholder needs.

Authors: The FGTI principles were systematically derived from the stakeholder interviews and needs analysis presented in Section 3 of the manuscript. We did not perform additional validation experiments such as follow-up surveys or operational deployment tests within the scope of this work, which instead prioritizes derivation followed by a feasibility demonstration via AnalyticScore. We agree that further validation would strengthen the foundational claims and will add a limitations subsection that explicitly discusses the current scope of the stakeholder analysis while outlining directions for future validation studies. revision: partial
Referee: [AnalyticScore Performance Evaluation] The performance claim that AnalyticScore remains within 0.06 QWK of uninterpretable SOTA rests on direct comparison to published baselines, but the manuscript supplies no implementation details, statistical significance tests, per-item variance, or ablation studies; without these, the feasibility demonstration for the FGTI principles cannot be rigorously assessed.

Authors: We acknowledge that the current presentation of results would benefit from greater rigor. In the revised manuscript we will add an appendix with full implementation details and hyperparameters, report per-item QWK values with associated variance, include statistical significance testing (e.g., paired tests) against the published baselines, and provide ablation studies that isolate the contribution of individual FGTI components to overall performance. revision: yes
Referee: [Human Featurization Comparison] The human featurization alignment result is limited to the same extraction task on ASAP-SAS items and does not test end-to-end traceability or interchangeability within live scoring workflows; this weakens support for the broader claim that AnalyticScore delivers stakeholder-usable interpretability via the FGTI principles.

Authors: The human featurization alignment experiment provides targeted evidence for the Groundedness principle by comparing feature extraction behavior on the public ASAP-SAS items. We recognize that this does not extend to end-to-end evaluation inside live operational scoring workflows. The manuscript's stated scope is a feasibility demonstration on the benchmark dataset; we will revise the discussion section to more clearly delineate this scope and to note that full workflow integration testing constitutes an important direction for subsequent research. revision: partial

Circularity Check

0 steps flagged

No circularity: performance and alignment claims rest on external empirical benchmarks

full rationale

The paper first extracts stakeholder needs via interviews to derive the four FGTI principles, then constructs the AnalyticScore framework to implement those principles, and finally reports direct empirical results: AnalyticScore achieves scoring accuracy competitive with uninterpretable SOTA (within 0.06 QWK on average across 10 ASAP-SAS items) and featurization behavior that aligns with human annotators on the same task. These outcomes are measured against published baselines on a public dataset and against independent human raters; they do not reduce by construction to any parameter fitted inside the paper, nor do they rely on a self-citation chain or an ansatz smuggled from prior author work. The derivation chain therefore remains self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of the stakeholder-needs analysis that produced the FGTI principles and on the assumption that the AnalyticScore implementation faithfully realizes those principles; no free parameters or new physical entities are introduced in the abstract.

axioms (1)

domain assumption Stakeholder needs analysis yields a complete and actionable set of interpretability requirements for automated scoring
The paper derives the four FGTI principles directly from this analysis.

pith-pipeline@v0.9.0 · 5520 in / 1338 out tokens · 41620 ms · 2026-05-17T20:59:04.263620+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop four foundational interpretability principles – Faithful, Grounded, Traceable, and Interchangeable (FGTI) – targeting the needs and benefits of large-scale assessment stakeholders

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 4 internal anchors

[1]

The standards for educational and psychological testing

AERA , APA , and NCME . The standards for educational and psychological testing. 2014

work page 2014
[2]

Chain-of-thought reasoning in the wild is not always faithful

Iv \'a n Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679, 2025

work page arXiv 2025
[3]

Take no shortcuts! stick to the rubric: A method for building trustworthy short answer scoring models

Yuya Asazuma, Hiroaki Funayama, Yuichiroh Matsubayashi, Tomoya Mizumoto, Paul Reisert, and Kentaro Inui. Take no shortcuts! stick to the rubric: A method for building trustworthy short answer scoring models. In International Conference on Higher Education Learning Methodologies and Technologies Online, pages 337--358. Springer, 2023

work page 2023
[4]

Cognitive foundations of automated scoring

Malcolm I Bauer and Diego Zapata-Rivera. Cognitive foundations of automated scoring. In Handbook of automated scoring, pages 13--28. Chapman and Hall/CRC, 2020

work page 2020
[5]

Automated scoring with validity in mind

Isaac I Bejar, Robert J Mislevy, and Mo Zhang. Automated scoring with validity in mind. The Wiley handbook of cognition and assessment: Frameworks, methodologies, and applications, pages 226--246, 2016

work page 2016
[6]

Moving the field forward: Some thoughts on validity and automated scoring

Randy Elliot Bennett. Moving the field forward: Some thoughts on validity and automated scoring. Automated scoring of complex tasks in computer-based testing, pages 403--412, 2006

work page 2006
[7]

Validity and automad scoring: It's not only the scoring

Randy Elliot Bennett and Isaac I Bejar. Validity and automad scoring: It's not only the scoring. Educational Measurement: Issues and Practice, 17 0 (4): 0 9--17, 1998

work page 1998
[8]

Validity and automated scoring

Randy Elliot Bennett and Mo Zhang. Validity and automated scoring. In Technology and testing, pages 142--173. Routledge, 2015

work page 2015
[9]

What use is educational assessment?, 2019

Amy I Berman, Michael J Feuer, and James W Pellegrino. What use is educational assessment?, 2019

work page 2019
[10]

Explainable machine learning in deployment

Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir Puri, Jos \'e MF Moura, and Peter Eckersley. Explainable machine learning in deployment. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pages 648--657, 2020

work page 2020
[11]

Assessment and classroom learning

Paul Black and Dylan Wiliam. Assessment and classroom learning. Assessment in Education: principles, policy & practice, 5 0 (1): 0 7--74, 1998

work page 1998
[12]

Explainable automatic grading with neural additive models

Aubrey Condor and Zachary Pardos. Explainable automatic grading with neural additive models. In International Conference on Artificial Intelligence in Education, pages 18--31. Springer, 2024

work page 2024
[13]

The effects of explanations in automated essay scoring systems on student trust and motivation

Rianne Conijn, Patricia Kahr, and Chris CP Snijders. The effects of explanations in automated essay scoring systems on student trust and motivation. Journal of Learning Analytics, 10 0 (1): 0 37--53, 2023

work page 2023
[14]

Qlora: Efficient finetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36: 0 10088--10115, 2023

work page 2023
[15]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171--4186, 2019

work page 2019
[16]

Assessment design with automated scoring in mind

Kristen DiCerbo, Emily Lai, and Ventura Matthew. Assessment design with automated scoring in mind. In Handbook of Automated Scoring, pages 29--48. Chapman and Hall/CRC, 2020

work page 2020
[17]

Validity arguments for ai-based automated scores: Essay scoring as an illustration

Steve Ferrara and Saed Qunbar. Validity arguments for ai-based automated scores: Essay scoring as an illustration. Journal of Educational Measurement, 59 0 (3): 0 288--313, 2022

work page 2022
[18]

The past, present, and future of automated scoring

Peter W Foltz, Duanli Yan, and Andr \'e A Rupp. The past, present, and future of automated scoring. In Handbook of Automated Scoring, pages 1--10. Chapman and Hall/CRC, 2020

work page 2020
[19]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Teachers' summative practices and assessment for learning--tensions and synergies

Wynne Harlen. Teachers' summative practices and assessment for learning--tensions and synergies. Curriculum Journal, 16 0 (2): 0 207--223, 2005

work page 2005
[21]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[22]

Ethics of ai in education: Towards a community-wide framework

Wayne Holmes, Kaska Porayska-Pomsta, Ken Holstein, Emma Sutherland, Toby Baker, Simon Buckingham Shum, Olga C Santos, Mercedes T Rodrigo, Mutlu Cukurova, Ig Ibert Bittencourt, et al. Ethics of ai in education: Towards a community-wide framework. International Journal of Artificial Intelligence in Education, pages 1--23, 2022

work page 2022
[23]

Math autoscoring is finally here—let's tap its potential for improving student performance

Institute of Education Statistics . Math autoscoring is finally here—let's tap its potential for improving student performance. https://ies.ed.gov/learn/blog/math-autoscoring-finally-here-lets-tap-its-potential-improving-student-performance, Oct 2023. [Accessed: Feb 21. 2025]

work page 2023
[24]

Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198--4205, 2020

work page 2020
[25]

Explainable artificial intelligence in education

Hassan Khosravi, Simon Buckingham Shum, Guanliang Chen, Cristina Conati, Yi-Shan Tsai, Judy Kay, Simon Knight, Roberto Martinez-Maldonado, Shazia Sadiq, and Dragan Ga s evi \'c . Explainable artificial intelligence in education. Computers and education: artificial intelligence, 3: 0 100074, 2022

work page 2022
[26]

Concept bottleneck models

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In International conference on machine learning, pages 5338--5348. PMLR, 2020

work page 2020
[27]

Content analysis: An introduction to its methodology

Klaus Krippendorff. Content analysis: An introduction to its methodology. Sage publications, 2018

work page 2018
[28]

Explainable automated essay scoring: Deep learning really has pedagogical value

Vivekanandan Kumar and David Boulanger. Explainable automated essay scoring: Deep learning really has pedagogical value. In Frontiers in education, volume 5, page 572367. Frontiers Media SA, 2020

work page 2020
[29]

Automated essay scoring and the deep learning black box: How are rubric scores determined? International Journal of Artificial Intelligence in Education, 31 0 (3): 0 538--584, 2021

Vivekanandan S Kumar and David Boulanger. Automated essay scoring and the deep learning black box: How are rubric scores determined? International Journal of Artificial Intelligence in Education, 31 0 (3): 0 538--584, 2021

work page 2021
[30]

Get it scored using autosas—an automated system for scoring short answers

Yaman Kumar, Swati Aggarwal, Debanjan Mahata, Rajiv Ratn Shah, Ponnurangam Kumaraguru, and Roger Zimmermann. Get it scored using autosas—an automated system for scoring short answers. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 9662--9669, 2019

work page 2019
[31]

C-rater: Automated scoring of short-answer questions

Claudia Leacock and Martin Chodorow. C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37 0 (4): 0 389--405, 2003

work page 2003
[32]

Applying large language models and chain-of-thought for automatic scoring

Gyeong-Geon Lee, Ehsan Latif, Xuansheng Wu, Ninghao Liu, and Xiaoming Zhai. Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, 6: 0 100213, 2024

work page 2024
[33]

An automated explainable educational assessment system built on llms

Jiazheng Li, Artem Bobrov, David West, Cesare Aloisi, and Yulan He. An automated explainable educational assessment system built on llms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29658--29660, 2025

work page 2025
[34]

Answer-state recurrent relational network (asrrn) for constructed response assessment and feedback grouping

Zhaohui Li, Susan Lloyd, Matthew Beckman, and Rebecca J Passonneau. Answer-state recurrent relational network (asrrn) for constructed response assessment and feedback grouping. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3879--3891, 2023

work page 2023
[35]

The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery

Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16 0 (3): 0 31--57, 2018

work page 2018
[36]

Fairness, accountability, transparency, and ethics (fate) in artificial intelligence (ai) and higher education: A systematic review

Bahar Memarian and Tenzin Doleck. Fairness, accountability, transparency, and ethics (fate) in artificial intelligence (ai) and higher education: A systematic review. Computers and Education: Artificial Intelligence, 5: 0 100152, 2023

work page 2023
[37]

An evidentiary-reasoning perspective on automated scoring: Commentary on part i

Robert J Mislevy. An evidentiary-reasoning perspective on automated scoring: Commentary on part i. In Handbook of Automated Scoring, pages 151--168. Chapman and Hall/CRC, 2020

work page 2020
[38]

The pragmatic turn in explainable artificial intelligence (xai)

Andr \'e s P \'a ez. The pragmatic turn in explainable artificial intelligence (xai). Minds and Machines, 29 0 (3): 0 441--459, 2019

work page 2019
[39]

On the consistency of ordinal regression methods

Fabian Pedregosa, Francis Bach, and Alexandre Gramfort. On the consistency of ordinal regression methods. Journal of Machine Learning Research, 18 0 (55): 0 1--35, 2017

work page 2017
[40]

Pellegrino

James W. Pellegrino. A Learning Sciences Perspective on the Design and Use of Assessment in Education, page 238–258. Cambridge Handbooks in Psychology. Cambridge University Press, 2022

work page 2022
[41]

Stakeholders in Explainable AI

Alun Preece, Dan Harborne, Dave Braines, Richard Tomsett, and Supriyo Chakraborty. Stakeholders in explainable ai. arXiv preprint arXiv:1810.00184, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[42]

Loss functions for preference levels: Regression with discrete ordered labels

Jason DM Rennie and Nathan Srebro. Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling, volume 1, pages 1--6. AAAI Press, Menlo Park, CA, 2005

work page 2005
[43]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1 0 (5): 0 206--215, 2019

work page 2019
[44]

Designing, evaluating, and deploying automated scoring systems with validity in mind: Methodological design decisions

Andr \'e A Rupp. Designing, evaluating, and deploying automated scoring systems with validity in mind: Methodological design decisions. Applied Measurement in Education, 31 0 (3): 0 191--214, 2018

work page 2018
[45]

Large language models cannot explain themselves

Advait Sarkar. Large language models cannot explain themselves. arXiv preprint arXiv:2405.04382, 2024

work page arXiv 2024
[46]

Explainability in automatic short answer grading

Tim Schlippe, Quintus Stierstorfer, Maurice ten Koppel, and Paul Libbrecht. Explainability in automatic short answer grading. In International conference on artificial intelligence in education technology, pages 69--87. Springer, 2022

work page 2022
[47]

The ABCs of how we learn: 26 scientifically proven approaches, how they work, and when to use them

Daniel L Schwartz, Jessica M Tsang, and Kristen P Blair. The ABCs of how we learn: 26 scientifically proven approaches, how they work, and when to use them. WW Norton & Company, 2016

work page 2016
[48]

Contrasting state-of-the-art in the machine scoring of short-form constructed responses

Mark D Shermis. Contrasting state-of-the-art in the machine scoring of short-form constructed responses. Educational Assessment, 20 0 (1): 0 46--65, 2015

work page 2015
[49]

Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting

Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36: 0 74952--74965, 2023

work page 2023
[50]

Large language models are not fair evaluators

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, et al. Large language models are not fair evaluators. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9440--9450, 2024

work page 2024
[51]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

work page 2022
[53]

Results of naep math item automated scoring data challenge & comparison between reading & math challenges

John Whitmer and Magdalen Beiting-Parrish. Results of naep math item automated scoring data challenge & comparison between reading & math challenges. 2023

work page 2023
[54]

Lessons learned about transparency, fairness, and explainability from two automated scoring challenges

John Whitmer and Magdalen Beiting-Parrish. Lessons learned about transparency, fairness, and explainability from two automated scoring challenges. In AI for Education: Bridging Innovation and Responsibility, 2024

work page 2024
[55]

Embedded formative assessment

Dylan Wiliam. Embedded formative assessment. Solution tree press, 2011

work page 2011
[56]

A framework for evaluation and use of automated scoring

David M Williamson, Xiaoming Xi, and F Jay Breyer. A framework for evaluation and use of automated scoring. Educational measurement: issues and practice, 31 0 (1): 0 2--13, 2012

work page 2012
[57]

Language in a bottle: Language model guided concept bottlenecks for interpretable image classification

Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19187--19197, 2023

work page 2023