Recognition: unknown
UX in the Age of AI: Rethinking Evaluation Metrics Through a Statistical Lens
Pith reviewed 2026-05-08 07:44 UTC · model grok-4.3
The pith
Traditional usability scores like SUS fail for AI products because outputs vary unpredictably, so a new framework models usability as a probability distribution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ADUX-Stat reconceptualises usability as a probabilistic signal distribution rather than a static scalar score by integrating the Interaction Entropy Index to quantify perceived unpredictability of AI responses, the Temporal Drift Coefficient to measure longitudinal changes in usability across interaction sessions, and the Bayesian Usability Confidence Score to generate credible interval estimates under uncertainty. The framework receives conceptual validation against five established AI product categories.
What carries the argument
The ADUX-Stat framework, which replaces fixed usability scores with a statistical distribution model that combines entropy, temporal drift, and Bayesian confidence measures.
If this is right
- Evaluation of AI interfaces can now quantify how much response variability users experience through an entropy index.
- Practitioners can monitor whether perceived usability improves or degrades over repeated interactions using the temporal coefficient.
- Usability reports gain credible intervals that express uncertainty instead of single point estimates.
- The same three-construct model applies across conversational agents, generative interfaces, and recommendation engines.
Where Pith is reading between the lines
- Real-time monitoring dashboards could alert product teams when the temporal drift coefficient signals falling usability in a live AI deployment.
- The entropy and drift ideas might transfer to non-AI domains where outputs also show natural variability, such as adaptive learning systems.
- Empirical trials on production user data would be needed to confirm whether the Bayesian intervals improve decision-making over current practices.
Load-bearing premise
The three new constructs can be turned into practical measurements that demonstrably work better than legacy scores when applied to real AI products rather than remaining at the level of conceptual categories.
What would settle it
A study that applies both traditional SUS scores and the full ADUX-Stat set to the same series of user sessions with an AI chatbot or recommender, then checks which set of numbers better predicts independent measures such as task success, session length, or return rate.
read the original abstract
The rapid proliferation of artificial intelligence (AI) in consumer-facing digital products has disrupted the assumptions underlying classical user experience (UX) evaluation frameworks. Legacy metrics such as the System Usability Scale (SUS), Net Promoter Score (NPS), and task completion rate were engineered for deterministic, rule-based interfaces where identical inputs yield identical outputs. In AI-mediated systems -- spanning conversational agents, generative interfaces, and recommendation engines -- outputs are stochastic, context-sensitive, and temporally variable, rendering these metrics structurally insufficient. This paper introduces the Adaptive Dynamic UX Statistical Framework (ADUX-Stat), a novel evaluation model that reconceptualises usability as a probabilistic signal distribution rather than a static scalar score. ADUX-Stat integrates three original constructs: (1) Interaction Entropy Index (IEI), quantifying the unpredictability of AI responses from a user perception standpoint; (2) Temporal Drift Coefficient (TDC), measuring longitudinal degradation or improvement of perceived usability over interaction sessions; and (3) Bayesian Usability Confidence Score (BUCS), producing credible interval estimates of usability quality under uncertainty. The framework is validated conceptually against five established AI product categories. ADUX-Stat addresses a critical gap at the intersection of HCI research, statistical modelling, and AI product evaluation, offering a reproducible, field-deployable methodology for UX practitioners and researchers alike.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that legacy UX metrics such as SUS, NPS, and task completion rates are structurally inadequate for AI-mediated interfaces because those interfaces produce stochastic, context-sensitive, and temporally variable outputs. It introduces the Adaptive Dynamic UX Statistical Framework (ADUX-Stat), which reconceptualizes usability as a probabilistic signal distribution rather than a static scalar, by integrating three new constructs: the Interaction Entropy Index (IEI) to quantify response unpredictability, the Temporal Drift Coefficient (TDC) to track longitudinal change, and the Bayesian Usability Confidence Score (BUCS) to produce credible-interval estimates. The framework is presented as conceptually validated across five AI product categories and is claimed to supply a reproducible, field-deployable replacement for existing metrics.
Significance. If the three constructs were supplied with explicit probability distributions, entropy or drift estimators, Bayesian priors, and empirical tests on interaction logs showing measurable improvement over SUS/NPS, the work could usefully extend HCI evaluation practice into the stochastic regime of modern AI products. As written, the absence of any such operationalization means the contribution remains at the level of a high-level proposal rather than a demonstrated methodology.
major comments (3)
- [§3] §3 (Framework definition): No probability distribution, entropy formula, drift estimator, or Bayesian prior is supplied for IEI, TDC, or BUCS. This directly undermines the central claim that ADUX-Stat constitutes a 'reproducible, field-deployable methodology,' because the constructs cannot be computed from user data without these definitions.
- [Validation section] Validation section: The 'conceptual validation' consists solely of category-level matching against five AI product types with no dataset, no quantitative scores, and no head-to-head comparison against SUS or NPS baselines. This leaves the assertion that the new framework outperforms legacy scalars unsupported and prevents any test of the probabilistic-signal-distribution reconceptualization.
- [Abstract and §1] Abstract and §1: The statement that ADUX-Stat 'reconceptualises usability as a probabilistic signal distribution' is not accompanied by any model equation showing how IEI, TDC, and BUCS combine into such a distribution or how credible intervals are derived.
minor comments (1)
- [Abstract] The five AI product categories referenced in the abstract and validation are never enumerated, reducing clarity for readers attempting to map the framework to concrete systems.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments, which highlight important gaps in the operationalization of our proposed framework. We agree that the current manuscript presents ADUX-Stat primarily at a conceptual level and will undertake major revisions to supply explicit mathematical definitions, model equations, and empirical comparisons. These changes will strengthen the reproducibility claims while preserving the core contribution of reconceptualizing UX evaluation for stochastic AI systems.
read point-by-point responses
-
Referee: [§3] §3 (Framework definition): No probability distribution, entropy formula, drift estimator, or Bayesian prior is supplied for IEI, TDC, or BUCS. This directly undermines the central claim that ADUX-Stat constitutes a 'reproducible, field-deployable methodology,' because the constructs cannot be computed from user data without these definitions.
Authors: We acknowledge that Section 3 currently defines the three constructs conceptually without the explicit formulas needed for direct computation. In the revised manuscript we will add: the probability distribution over AI response variability (a mixture model capturing stochasticity), the entropy formula for IEI (an adaptation of Shannon entropy applied to sequences of user-perceived response categories), the drift estimator for TDC (a linear mixed-effects regression coefficient on session-level usability signals), and the Bayesian priors plus likelihood for BUCS (weakly informative priors on usability parameters with credible-interval computation via MCMC). Worked numerical examples using synthetic interaction logs will be included to demonstrate field deployment. revision: yes
-
Referee: [Validation section] Validation section: The 'conceptual validation' consists solely of category-level matching against five AI product types with no dataset, no quantitative scores, and no head-to-head comparison against SUS or NPS baselines. This leaves the assertion that the new framework outperforms legacy scalars unsupported and prevents any test of the probabilistic-signal-distribution reconceptualization.
Authors: The original validation is indeed limited to conceptual mapping. We agree this does not provide quantitative support for superiority over SUS or NPS. In revision we will add an empirical validation subsection that analyzes anonymized interaction logs from two AI product categories (e.g., a conversational agent and a generative recommendation interface). We will compute IEI, TDC, and BUCS, report descriptive statistics and credible intervals, and perform direct comparisons with SUS and NPS scores collected from the same users, including statistical tests of differential sensitivity to temporal variability. revision: yes
-
Referee: [Abstract and §1] Abstract and §1: The statement that ADUX-Stat 'reconceptualises usability as a probabilistic signal distribution' is not accompanied by any model equation showing how IEI, TDC, and BUCS combine into such a distribution or how credible intervals are derived.
Authors: We will revise both the abstract and Section 1 to include a compact model equation. Usability will be expressed as a hierarchical Bayesian posterior p(U | data) where the signal distribution is a function of IEI (entropy term), TDC (drift term), and BUCS (posterior summary). Credible intervals will be defined as the 95% highest-posterior-density interval of this posterior. A short derivation sketch will be added to make the probabilistic reconceptualization explicit. revision: yes
Circularity Check
No circularity: conceptual framework introduces additive constructs without equations or derivations
full rationale
The manuscript presents ADUX-Stat as an integrative conceptual model that redefines usability via three newly named constructs (IEI, TDC, BUCS) and offers only conceptual validation across AI product categories. No equations, probability distributions, entropy formulas, drift estimators, Bayesian priors, or computational procedures appear in the text, so no derivation chain exists that could reduce a claimed prediction or result to its own inputs by construction. The framework is explicitly additive rather than derived from legacy metrics or fitted parameters, satisfying the default expectation of no significant circularity for papers whose contribution remains at the level of naming and high-level integration without mathematical reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Legacy UX metrics such as SUS, NPS, and task completion rate were engineered only for deterministic interfaces where identical inputs yield identical outputs.
invented entities (3)
-
Interaction Entropy Index (IEI)
no independent evidence
-
Temporal Drift Coefficient (TDC)
no independent evidence
-
Bayesian Usability Confidence Score (BUCS)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Usability Engineering,
J. Nielsen, "Usability Engineering," Academic Press, 1993
1993
-
[2]
SUS: A quick and dirty usability scale,
J. Brooke, "SUS: A quick and dirty usability scale," in Usability Evaluation In Industry, Taylor & Francis, 1996, pp. 189–194
1996
-
[3]
Ergonomics of human -system interaction — Part 11: Usability: Definitions and concepts,
ISO 9241 -11:2018, "Ergonomics of human -system interaction — Part 11: Usability: Definitions and concepts," ISO, 2018
2018
-
[4]
Quantifying the User Experience: Practical Statistics for User Research,
J. Sauro and J. R. Lewis, "Quantifying the User Experience: Practical Statistics for User Research," 2nd ed. Morgan Kaufmann, 2016
2016
-
[5]
When designing usability questionnaires, does it hurt to be positive?
J. Sauro and J. R. Lewis, "When designing usability questionnaires, does it hurt to be positive?" in Proc. ACM CHI, 2011, pp. 2215–2224
2011
-
[6]
Measuring the User Experience,
T. Tullis and B. Albert, "Measuring the User Experience," 2nd ed. Morgan Kaufmann, 2013
2013
-
[7]
Software engineering for machine learning: A case study,
S. Amershi et al., "Software engineering for machine learning: A case study," in Proc. IEEE/ACM 41st ICSE, 2019, pp. 291–300
2019
-
[8]
The effects of example -based explanations in a machine learning interface,
C. Cai et al., "The effects of example -based explanations in a machine learning interface," in Proc. ACM IUI, 2019, pp. 258–262
2019
-
[9]
Chatbots and the new world of HCI,
A. Folstad and P. B. Brandtzaeg, "Chatbots and the new world of HCI," Interactions, vol. 24, no. 4, pp. 38–42, 2017
2017
-
[10]
Human -Centered AI,
B. Shneiderman, "Human -Centered AI," Oxford University Press, 2022
2022
-
[11]
Sample size in usability studies,
M. Schmettow, "Sample size in usability studies," Communications of the ACM, vol. 55, no. 4, pp. 64 –70, 2012
2012
-
[12]
A mathematical theory of communication,
C. E. Shannon, "A mathematical theory of communication," Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948
1948
-
[13]
Bayesian touch: A statistical criterion of target selection with finger touch,
X. Bi and S. Zhai, "Bayesian touch: A statistical criterion of target selection with finger touch," in Proc. ACM UIST, 2013, pp. 51–60
2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.