Evaluating general-purpose ai with psychometrics

Xiting Wang, Liming Jiang, Jose Hernandez-Orallo, David Stillwell, Luning Sun, Fang Luo, Xing Xie · 2023 · arXiv 2310.16379

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

representative citing papers

Beyond Value Benchmarks: Measuring Value-Structure Alignment in Large Language Models via Symmetric Q-Sorts

cs.CL · 2026-06-20 · unverdicted · novelty 7.0

Introduces a Q-sort protocol using human reference factors to quantify LLM value-structure alignment via Procrustes similarity and RSA correlations, revealing cross-family heterogeneity and localized misalignments.

FairTree: Subgroup Fairness Auditing of Machine Learning Models with Bias-Variance Decomposition

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

FairTree audits ML models for subgroup fairness by decomposing performance disparities into systematic bias and variance using permutation-based and fluctuation tests adapted from psychometric methods.

An Interpretable and Scalable Framework for Evaluating Large Language Models

stat.ML · 2026-05-07 · unverdicted · novelty 6.0

A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.

Position: AI Evaluations Should be Grounded on a Theory of Capability

cs.AI · 2025-09-23 · conditional · novelty 5.0

AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.

Designing Psychometric Bias Measures for ChatBots: An Application to Racial Bias Measurement

cs.HC · 2025-08-17 · unverdicted · novelty 5.0

STAMP-LLM is a two-phase psychometric protocol for designing and applying bias measures to LLMs, illustrated with one explicit and two implicit racial bias tests.

citing papers explorer

Showing 1 of 1 citing paper after filters.

An Interpretable and Scalable Framework for Evaluating Large Language Models stat.ML · 2026-05-07 · unverdicted · none · ref 54
A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.

Evaluating general-purpose ai with psychometrics

fields

years

verdicts

representative citing papers

citing papers explorer