Introduces a Q-sort protocol using human reference factors to quantify LLM value-structure alignment via Procrustes similarity and RSA correlations, revealing cross-family heterogeneity and localized misalignments.
Evaluating general-purpose ai with psychometrics
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
FairTree audits ML models for subgroup fairness by decomposing performance disparities into systematic bias and variance using permutation-based and fluctuation tests adapted from psychometric methods.
A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.
STAMP-LLM is a two-phase psychometric protocol for designing and applying bias measures to LLMs, illustrated with one explicit and two implicit racial bias tests.
citing papers explorer
-
An Interpretable and Scalable Framework for Evaluating Large Language Models
A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.