Recognition: 1 theorem link
· Lean TheoremInterpretability from the Ground Up: Stakeholder-Centric Design of Automated Scoring in Educational Assessments
Pith reviewed 2026-05-17 20:59 UTC · model grok-4.3
The pith
Stakeholder analysis yields four principles that guide an automated scoring system to near state-of-the-art accuracy while remaining explainable.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AnalyticScore applies the FGTI principles through explicit feature extraction and traceable scoring steps to produce grades for student open-ended text responses. Across the ten ASAP-SAS items it surpasses many prior interpretable methods in accuracy while remaining within 0.06 QWK, on average, of the current uninterpretable state of the art. Its feature selections align closely with those of human raters given the same featurization instructions.
What carries the argument
AnalyticScore, a reference framework that encodes the four FGTI principles through human-aligned, explicitly extracted features and step-wise traceable scoring.
If this is right
- Automated scoring systems can satisfy transparency demands from multiple stakeholder groups without large accuracy penalties.
- Feature choices that match human raters increase the chance that educators will understand and accept the scores.
- The same four principles can serve as a design template for interpretable AI tools in other parts of education.
- Traceable outputs let stakeholders identify exactly which parts of a response drove a given score.
Where Pith is reading between the lines
- If the FGTI principles prove robust, the same stakeholder-first method could be applied to other high-stakes AI decisions such as admissions or hiring tools.
- Real classroom pilots would be required to check whether the measured human alignment produces measurable gains in trust and fairness perceptions.
- Extending the framework to handle spoken responses or multimodal submissions would test how well the principles travel beyond text.
Load-bearing premise
The four FGTI principles derived from stakeholder interviews are sufficient to satisfy the real interpretability requirements of every group that uses large-scale assessments.
What would settle it
A deployment study in which actual teachers or administrators report that AnalyticScore outputs still prevent them from explaining or contesting individual student scores as readily as they can with current human-scored rubrics.
Figures
read the original abstract
AI-driven automated scoring systems offer scalable and efficient means of evaluating complex student-generated responses. Yet, despite increasing demand for transparency and interpretability, the field has yet to develop a widely accepted solution for interpretable automated scoring to be used in large-scale real-world assessments. This work takes a principled approach to address this challenge. We analyze the needs and potential benefits of interpretable automated scoring for various assessment stakeholder groups and develop four principles of interpretability -- (F)aithfulness, (G)roundedness, (T)raceability, and (I)nterchangeability (FGTI) -- targeted at those needs. To illustrate the feasibility of implementing these principles, we develop the AnalyticScore framework as a reference framework. When applied to the domain of text-based constructed-response scoring, AnalyticScore outperforms many uninterpretable scoring methods in terms of scoring accuracy and is, on average, within 0.06 QWK of the uninterpretable SOTA across 10 items from the ASAP-SAS dataset. By comparing against human annotators conducting the same featurization task, we further demonstrate that the featurization behavior of AnalyticScore aligns well with that of humans.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to address the lack of widely accepted interpretable automated scoring for large-scale educational assessments by first analyzing stakeholder needs and deriving four principles of interpretability—Faithfulness, Groundedness, Traceability, and Interchangeability (FGTI)—then illustrating their feasibility via the AnalyticScore reference framework. When applied to text-based constructed-response scoring, AnalyticScore is reported to outperform many uninterpretable methods and to stay within 0.06 QWK of uninterpretable SOTA on average across 10 ASAP-SAS items, while its featurization behavior aligns with that of human annotators on the same task.
Significance. If the central claims hold, the work offers a principled, stakeholder-derived approach to interpretability that could help bridge performance and transparency gaps in educational AI. Strengths include the explicit grounding in stakeholder analysis, the use of a public benchmark dataset for empirical comparison, and the human featurization alignment experiment, all of which provide concrete, falsifiable anchors for the feasibility argument.
major comments (3)
- [Stakeholder Analysis and FGTI Derivation] The derivation of the FGTI principles from stakeholder interviews is presented as foundational, yet the manuscript provides no separate validation (e.g., follow-up surveys, decision-making experiments, or operational deployment tests) to establish that these four axes are necessary and sufficient for the interpretability requirements of raters, administrators, or students under high-stakes conditions; this directly underpins the claim that FGTI captures stakeholder needs.
- [AnalyticScore Performance Evaluation] The performance claim that AnalyticScore remains within 0.06 QWK of uninterpretable SOTA rests on direct comparison to published baselines, but the manuscript supplies no implementation details, statistical significance tests, per-item variance, or ablation studies; without these, the feasibility demonstration for the FGTI principles cannot be rigorously assessed.
- [Human Featurization Comparison] The human featurization alignment result is limited to the same extraction task on ASAP-SAS items and does not test end-to-end traceability or interchangeability within live scoring workflows; this weakens support for the broader claim that AnalyticScore delivers stakeholder-usable interpretability via the FGTI principles.
minor comments (2)
- [Abstract] The abstract states that AnalyticScore 'outperforms many uninterpretable scoring methods' without naming the specific baselines or reporting the quantitative margins; adding these details would improve clarity.
- [Throughout] Ensure that all acronyms (e.g., QWK, ASAP-SAS) are defined at first use and used consistently in tables and figure captions.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Stakeholder Analysis and FGTI Derivation] The derivation of the FGTI principles from stakeholder interviews is presented as foundational, yet the manuscript provides no separate validation (e.g., follow-up surveys, decision-making experiments, or operational deployment tests) to establish that these four axes are necessary and sufficient for the interpretability requirements of raters, administrators, or students under high-stakes conditions; this directly underpins the claim that FGTI captures stakeholder needs.
Authors: The FGTI principles were systematically derived from the stakeholder interviews and needs analysis presented in Section 3 of the manuscript. We did not perform additional validation experiments such as follow-up surveys or operational deployment tests within the scope of this work, which instead prioritizes derivation followed by a feasibility demonstration via AnalyticScore. We agree that further validation would strengthen the foundational claims and will add a limitations subsection that explicitly discusses the current scope of the stakeholder analysis while outlining directions for future validation studies. revision: partial
-
Referee: [AnalyticScore Performance Evaluation] The performance claim that AnalyticScore remains within 0.06 QWK of uninterpretable SOTA rests on direct comparison to published baselines, but the manuscript supplies no implementation details, statistical significance tests, per-item variance, or ablation studies; without these, the feasibility demonstration for the FGTI principles cannot be rigorously assessed.
Authors: We acknowledge that the current presentation of results would benefit from greater rigor. In the revised manuscript we will add an appendix with full implementation details and hyperparameters, report per-item QWK values with associated variance, include statistical significance testing (e.g., paired tests) against the published baselines, and provide ablation studies that isolate the contribution of individual FGTI components to overall performance. revision: yes
-
Referee: [Human Featurization Comparison] The human featurization alignment result is limited to the same extraction task on ASAP-SAS items and does not test end-to-end traceability or interchangeability within live scoring workflows; this weakens support for the broader claim that AnalyticScore delivers stakeholder-usable interpretability via the FGTI principles.
Authors: The human featurization alignment experiment provides targeted evidence for the Groundedness principle by comparing feature extraction behavior on the public ASAP-SAS items. We recognize that this does not extend to end-to-end evaluation inside live operational scoring workflows. The manuscript's stated scope is a feasibility demonstration on the benchmark dataset; we will revise the discussion section to more clearly delineate this scope and to note that full workflow integration testing constitutes an important direction for subsequent research. revision: partial
Circularity Check
No circularity: performance and alignment claims rest on external empirical benchmarks
full rationale
The paper first extracts stakeholder needs via interviews to derive the four FGTI principles, then constructs the AnalyticScore framework to implement those principles, and finally reports direct empirical results: AnalyticScore achieves scoring accuracy competitive with uninterpretable SOTA (within 0.06 QWK on average across 10 ASAP-SAS items) and featurization behavior that aligns with human annotators on the same task. These outcomes are measured against published baselines on a public dataset and against independent human raters; they do not reduce by construction to any parameter fitted inside the paper, nor do they rely on a self-citation chain or an ansatz smuggled from prior author work. The derivation chain therefore remains self-contained against external benchmarks rather than tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Stakeholder needs analysis yields a complete and actionable set of interpretability requirements for automated scoring
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop four foundational interpretability principles – Faithful, Grounded, Traceable, and Interchangeable (FGTI) – targeting the needs and benefits of large-scale assessment stakeholders
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The standards for educational and psychological testing
AERA , APA , and NCME . The standards for educational and psychological testing. 2014
work page 2014
-
[2]
Chain-of-thought reasoning in the wild is not always faithful
Iv \'a n Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679, 2025
-
[3]
Yuya Asazuma, Hiroaki Funayama, Yuichiroh Matsubayashi, Tomoya Mizumoto, Paul Reisert, and Kentaro Inui. Take no shortcuts! stick to the rubric: A method for building trustworthy short answer scoring models. In International Conference on Higher Education Learning Methodologies and Technologies Online, pages 337--358. Springer, 2023
work page 2023
-
[4]
Cognitive foundations of automated scoring
Malcolm I Bauer and Diego Zapata-Rivera. Cognitive foundations of automated scoring. In Handbook of automated scoring, pages 13--28. Chapman and Hall/CRC, 2020
work page 2020
-
[5]
Automated scoring with validity in mind
Isaac I Bejar, Robert J Mislevy, and Mo Zhang. Automated scoring with validity in mind. The Wiley handbook of cognition and assessment: Frameworks, methodologies, and applications, pages 226--246, 2016
work page 2016
-
[6]
Moving the field forward: Some thoughts on validity and automated scoring
Randy Elliot Bennett. Moving the field forward: Some thoughts on validity and automated scoring. Automated scoring of complex tasks in computer-based testing, pages 403--412, 2006
work page 2006
-
[7]
Validity and automad scoring: It's not only the scoring
Randy Elliot Bennett and Isaac I Bejar. Validity and automad scoring: It's not only the scoring. Educational Measurement: Issues and Practice, 17 0 (4): 0 9--17, 1998
work page 1998
-
[8]
Validity and automated scoring
Randy Elliot Bennett and Mo Zhang. Validity and automated scoring. In Technology and testing, pages 142--173. Routledge, 2015
work page 2015
-
[9]
What use is educational assessment?, 2019
Amy I Berman, Michael J Feuer, and James W Pellegrino. What use is educational assessment?, 2019
work page 2019
-
[10]
Explainable machine learning in deployment
Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir Puri, Jos \'e MF Moura, and Peter Eckersley. Explainable machine learning in deployment. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pages 648--657, 2020
work page 2020
-
[11]
Assessment and classroom learning
Paul Black and Dylan Wiliam. Assessment and classroom learning. Assessment in Education: principles, policy & practice, 5 0 (1): 0 7--74, 1998
work page 1998
-
[12]
Explainable automatic grading with neural additive models
Aubrey Condor and Zachary Pardos. Explainable automatic grading with neural additive models. In International Conference on Artificial Intelligence in Education, pages 18--31. Springer, 2024
work page 2024
-
[13]
The effects of explanations in automated essay scoring systems on student trust and motivation
Rianne Conijn, Patricia Kahr, and Chris CP Snijders. The effects of explanations in automated essay scoring systems on student trust and motivation. Journal of Learning Analytics, 10 0 (1): 0 37--53, 2023
work page 2023
-
[14]
Qlora: Efficient finetuning of quantized llms
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36: 0 10088--10115, 2023
work page 2023
-
[15]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171--4186, 2019
work page 2019
-
[16]
Assessment design with automated scoring in mind
Kristen DiCerbo, Emily Lai, and Ventura Matthew. Assessment design with automated scoring in mind. In Handbook of Automated Scoring, pages 29--48. Chapman and Hall/CRC, 2020
work page 2020
-
[17]
Validity arguments for ai-based automated scores: Essay scoring as an illustration
Steve Ferrara and Saed Qunbar. Validity arguments for ai-based automated scores: Essay scoring as an illustration. Journal of Educational Measurement, 59 0 (3): 0 288--313, 2022
work page 2022
-
[18]
The past, present, and future of automated scoring
Peter W Foltz, Duanli Yan, and Andr \'e A Rupp. The past, present, and future of automated scoring. In Handbook of Automated Scoring, pages 1--10. Chapman and Hall/CRC, 2020
work page 2020
-
[19]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Teachers' summative practices and assessment for learning--tensions and synergies
Wynne Harlen. Teachers' summative practices and assessment for learning--tensions and synergies. Curriculum Journal, 16 0 (2): 0 207--223, 2005
work page 2005
-
[21]
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[22]
Ethics of ai in education: Towards a community-wide framework
Wayne Holmes, Kaska Porayska-Pomsta, Ken Holstein, Emma Sutherland, Toby Baker, Simon Buckingham Shum, Olga C Santos, Mercedes T Rodrigo, Mutlu Cukurova, Ig Ibert Bittencourt, et al. Ethics of ai in education: Towards a community-wide framework. International Journal of Artificial Intelligence in Education, pages 1--23, 2022
work page 2022
-
[23]
Math autoscoring is finally here—let's tap its potential for improving student performance
Institute of Education Statistics . Math autoscoring is finally here—let's tap its potential for improving student performance. https://ies.ed.gov/learn/blog/math-autoscoring-finally-here-lets-tap-its-potential-improving-student-performance, Oct 2023. [Accessed: Feb 21. 2025]
work page 2023
-
[24]
Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198--4205, 2020
work page 2020
-
[25]
Explainable artificial intelligence in education
Hassan Khosravi, Simon Buckingham Shum, Guanliang Chen, Cristina Conati, Yi-Shan Tsai, Judy Kay, Simon Knight, Roberto Martinez-Maldonado, Shazia Sadiq, and Dragan Ga s evi \'c . Explainable artificial intelligence in education. Computers and education: artificial intelligence, 3: 0 100074, 2022
work page 2022
-
[26]
Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In International conference on machine learning, pages 5338--5348. PMLR, 2020
work page 2020
-
[27]
Content analysis: An introduction to its methodology
Klaus Krippendorff. Content analysis: An introduction to its methodology. Sage publications, 2018
work page 2018
-
[28]
Explainable automated essay scoring: Deep learning really has pedagogical value
Vivekanandan Kumar and David Boulanger. Explainable automated essay scoring: Deep learning really has pedagogical value. In Frontiers in education, volume 5, page 572367. Frontiers Media SA, 2020
work page 2020
-
[29]
Vivekanandan S Kumar and David Boulanger. Automated essay scoring and the deep learning black box: How are rubric scores determined? International Journal of Artificial Intelligence in Education, 31 0 (3): 0 538--584, 2021
work page 2021
-
[30]
Get it scored using autosas—an automated system for scoring short answers
Yaman Kumar, Swati Aggarwal, Debanjan Mahata, Rajiv Ratn Shah, Ponnurangam Kumaraguru, and Roger Zimmermann. Get it scored using autosas—an automated system for scoring short answers. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 9662--9669, 2019
work page 2019
-
[31]
C-rater: Automated scoring of short-answer questions
Claudia Leacock and Martin Chodorow. C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37 0 (4): 0 389--405, 2003
work page 2003
-
[32]
Applying large language models and chain-of-thought for automatic scoring
Gyeong-Geon Lee, Ehsan Latif, Xuansheng Wu, Ninghao Liu, and Xiaoming Zhai. Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, 6: 0 100213, 2024
work page 2024
-
[33]
An automated explainable educational assessment system built on llms
Jiazheng Li, Artem Bobrov, David West, Cesare Aloisi, and Yulan He. An automated explainable educational assessment system built on llms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29658--29660, 2025
work page 2025
-
[34]
Zhaohui Li, Susan Lloyd, Matthew Beckman, and Rebecca J Passonneau. Answer-state recurrent relational network (asrrn) for constructed response assessment and feedback grouping. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3879--3891, 2023
work page 2023
-
[35]
Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16 0 (3): 0 31--57, 2018
work page 2018
-
[36]
Bahar Memarian and Tenzin Doleck. Fairness, accountability, transparency, and ethics (fate) in artificial intelligence (ai) and higher education: A systematic review. Computers and Education: Artificial Intelligence, 5: 0 100152, 2023
work page 2023
-
[37]
An evidentiary-reasoning perspective on automated scoring: Commentary on part i
Robert J Mislevy. An evidentiary-reasoning perspective on automated scoring: Commentary on part i. In Handbook of Automated Scoring, pages 151--168. Chapman and Hall/CRC, 2020
work page 2020
-
[38]
The pragmatic turn in explainable artificial intelligence (xai)
Andr \'e s P \'a ez. The pragmatic turn in explainable artificial intelligence (xai). Minds and Machines, 29 0 (3): 0 441--459, 2019
work page 2019
-
[39]
On the consistency of ordinal regression methods
Fabian Pedregosa, Francis Bach, and Alexandre Gramfort. On the consistency of ordinal regression methods. Journal of Machine Learning Research, 18 0 (55): 0 1--35, 2017
work page 2017
-
[40]
James W. Pellegrino. A Learning Sciences Perspective on the Design and Use of Assessment in Education, page 238–258. Cambridge Handbooks in Psychology. Cambridge University Press, 2022
work page 2022
-
[41]
Stakeholders in Explainable AI
Alun Preece, Dan Harborne, Dave Braines, Richard Tomsett, and Supriyo Chakraborty. Stakeholders in explainable ai. arXiv preprint arXiv:1810.00184, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[42]
Loss functions for preference levels: Regression with discrete ordered labels
Jason DM Rennie and Nathan Srebro. Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling, volume 1, pages 1--6. AAAI Press, Menlo Park, CA, 2005
work page 2005
-
[43]
Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1 0 (5): 0 206--215, 2019
work page 2019
-
[44]
Andr \'e A Rupp. Designing, evaluating, and deploying automated scoring systems with validity in mind: Methodological design decisions. Applied Measurement in Education, 31 0 (3): 0 191--214, 2018
work page 2018
-
[45]
Large language models cannot explain themselves
Advait Sarkar. Large language models cannot explain themselves. arXiv preprint arXiv:2405.04382, 2024
-
[46]
Explainability in automatic short answer grading
Tim Schlippe, Quintus Stierstorfer, Maurice ten Koppel, and Paul Libbrecht. Explainability in automatic short answer grading. In International conference on artificial intelligence in education technology, pages 69--87. Springer, 2022
work page 2022
-
[47]
The ABCs of how we learn: 26 scientifically proven approaches, how they work, and when to use them
Daniel L Schwartz, Jessica M Tsang, and Kristen P Blair. The ABCs of how we learn: 26 scientifically proven approaches, how they work, and when to use them. WW Norton & Company, 2016
work page 2016
-
[48]
Contrasting state-of-the-art in the machine scoring of short-form constructed responses
Mark D Shermis. Contrasting state-of-the-art in the machine scoring of short-form constructed responses. Educational Assessment, 20 0 (1): 0 46--65, 2015
work page 2015
-
[49]
Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36: 0 74952--74965, 2023
work page 2023
-
[50]
Large language models are not fair evaluators
Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, et al. Large language models are not fair evaluators. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9440--9450, 2024
work page 2024
-
[51]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022
work page 2022
-
[53]
John Whitmer and Magdalen Beiting-Parrish. Results of naep math item automated scoring data challenge & comparison between reading & math challenges. 2023
work page 2023
-
[54]
John Whitmer and Magdalen Beiting-Parrish. Lessons learned about transparency, fairness, and explainability from two automated scoring challenges. In AI for Education: Bridging Innovation and Responsibility, 2024
work page 2024
-
[55]
Dylan Wiliam. Embedded formative assessment. Solution tree press, 2011
work page 2011
-
[56]
A framework for evaluation and use of automated scoring
David M Williamson, Xiaoming Xi, and F Jay Breyer. A framework for evaluation and use of automated scoring. Educational measurement: issues and practice, 31 0 (1): 0 2--13, 2012
work page 2012
-
[57]
Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19187--19197, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.