arxiv: 2605.05600 · v1 · submitted 2026-05-07 · 💻 cs.HC

Recognition: unknown

UX in the Age of AI: Rethinking Evaluation Metrics Through a Statistical Lens

Harish Vijayakumar

Authors on Pith no claims yet

Pith reviewed 2026-05-08 07:44 UTC · model grok-4.3

classification 💻 cs.HC

keywords user experience evaluationAI interfacesusability metricsstatistical modelingprobabilistic assessmenthuman-computer interactionBayesian methodstemporal analysis

0 comments

The pith

Traditional usability scores like SUS fail for AI products because outputs vary unpredictably, so a new framework models usability as a probability distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that legacy metrics such as the System Usability Scale and task completion rates were built for fixed, rule-based interfaces and therefore cannot handle the stochastic and changing nature of AI-mediated systems. It proposes the ADUX-Stat framework to treat usability instead as a probabilistic signal distribution that incorporates response unpredictability, changes across sessions, and uncertainty estimates. A sympathetic reader would care because conversational agents, generative tools, and recommendation engines now dominate consumer software, yet current evaluation methods leave practitioners without reliable ways to measure or improve the actual experience.

Core claim

ADUX-Stat reconceptualises usability as a probabilistic signal distribution rather than a static scalar score by integrating the Interaction Entropy Index to quantify perceived unpredictability of AI responses, the Temporal Drift Coefficient to measure longitudinal changes in usability across interaction sessions, and the Bayesian Usability Confidence Score to generate credible interval estimates under uncertainty. The framework receives conceptual validation against five established AI product categories.

What carries the argument

The ADUX-Stat framework, which replaces fixed usability scores with a statistical distribution model that combines entropy, temporal drift, and Bayesian confidence measures.

If this is right

Evaluation of AI interfaces can now quantify how much response variability users experience through an entropy index.
Practitioners can monitor whether perceived usability improves or degrades over repeated interactions using the temporal coefficient.
Usability reports gain credible intervals that express uncertainty instead of single point estimates.
The same three-construct model applies across conversational agents, generative interfaces, and recommendation engines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time monitoring dashboards could alert product teams when the temporal drift coefficient signals falling usability in a live AI deployment.
The entropy and drift ideas might transfer to non-AI domains where outputs also show natural variability, such as adaptive learning systems.
Empirical trials on production user data would be needed to confirm whether the Bayesian intervals improve decision-making over current practices.

Load-bearing premise

The three new constructs can be turned into practical measurements that demonstrably work better than legacy scores when applied to real AI products rather than remaining at the level of conceptual categories.

What would settle it

A study that applies both traditional SUS scores and the full ADUX-Stat set to the same series of user sessions with an AI chatbot or recommender, then checks which set of numbers better predicts independent measures such as task success, session length, or return rate.

read the original abstract

The rapid proliferation of artificial intelligence (AI) in consumer-facing digital products has disrupted the assumptions underlying classical user experience (UX) evaluation frameworks. Legacy metrics such as the System Usability Scale (SUS), Net Promoter Score (NPS), and task completion rate were engineered for deterministic, rule-based interfaces where identical inputs yield identical outputs. In AI-mediated systems -- spanning conversational agents, generative interfaces, and recommendation engines -- outputs are stochastic, context-sensitive, and temporally variable, rendering these metrics structurally insufficient. This paper introduces the Adaptive Dynamic UX Statistical Framework (ADUX-Stat), a novel evaluation model that reconceptualises usability as a probabilistic signal distribution rather than a static scalar score. ADUX-Stat integrates three original constructs: (1) Interaction Entropy Index (IEI), quantifying the unpredictability of AI responses from a user perception standpoint; (2) Temporal Drift Coefficient (TDC), measuring longitudinal degradation or improvement of perceived usability over interaction sessions; and (3) Bayesian Usability Confidence Score (BUCS), producing credible interval estimates of usability quality under uncertainty. The framework is validated conceptually against five established AI product categories. ADUX-Stat addresses a critical gap at the intersection of HCI research, statistical modelling, and AI product evaluation, offering a reproducible, field-deployable methodology for UX practitioners and researchers alike.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names a real mismatch between old UX metrics and stochastic AI interfaces but supplies no equations, data, or procedures to make its three new constructs usable.

read the letter

The main thing to know is that this paper correctly flags why SUS, NPS, and task completion rates break down for AI products that give varying outputs, but it never delivers the statistical replacements it promises. The motivation lands: AI interfaces are probabilistic and change over sessions, so treating usability as a fixed score misses the point. Framing it instead as a signal distribution with entropy, drift, and credible intervals is a sensible direction to explore. Naming the three pieces—Interaction Entropy Index, Temporal Drift Coefficient, and Bayesian Usability Confidence Score—gives a compact vocabulary for talking about variability in user-AI interactions. That part is useful as a prompt for further work. The rest does not hold up. No formulas appear for any of the indices, no probability distributions or priors are specified, and there is no worked example on actual logs or even a small dataset. The validation stays at the level of saying the ideas fit five AI product categories, which adds nothing about whether the measures are computable, reliable, or better than existing ones. Existing literature on information-theoretic UX or Bayesian user modeling gets no real engagement either. This is for researchers or practitioners already thinking about evaluation in generative systems who want a high-level sketch of the problem. It could spark a short discussion in a reading group about what a probabilistic UX metric should look like, but it has too little substance for serious engagement. I would not cite it, and it does not deserve peer review until the authors add derivations, an implementation, and at least a pilot test on real interaction data.

Referee Report

3 major / 1 minor

Summary. The paper argues that legacy UX metrics such as SUS, NPS, and task completion rates are structurally inadequate for AI-mediated interfaces because those interfaces produce stochastic, context-sensitive, and temporally variable outputs. It introduces the Adaptive Dynamic UX Statistical Framework (ADUX-Stat), which reconceptualizes usability as a probabilistic signal distribution rather than a static scalar, by integrating three new constructs: the Interaction Entropy Index (IEI) to quantify response unpredictability, the Temporal Drift Coefficient (TDC) to track longitudinal change, and the Bayesian Usability Confidence Score (BUCS) to produce credible-interval estimates. The framework is presented as conceptually validated across five AI product categories and is claimed to supply a reproducible, field-deployable replacement for existing metrics.

Significance. If the three constructs were supplied with explicit probability distributions, entropy or drift estimators, Bayesian priors, and empirical tests on interaction logs showing measurable improvement over SUS/NPS, the work could usefully extend HCI evaluation practice into the stochastic regime of modern AI products. As written, the absence of any such operationalization means the contribution remains at the level of a high-level proposal rather than a demonstrated methodology.

major comments (3)

[§3] §3 (Framework definition): No probability distribution, entropy formula, drift estimator, or Bayesian prior is supplied for IEI, TDC, or BUCS. This directly undermines the central claim that ADUX-Stat constitutes a 'reproducible, field-deployable methodology,' because the constructs cannot be computed from user data without these definitions.
[Validation section] Validation section: The 'conceptual validation' consists solely of category-level matching against five AI product types with no dataset, no quantitative scores, and no head-to-head comparison against SUS or NPS baselines. This leaves the assertion that the new framework outperforms legacy scalars unsupported and prevents any test of the probabilistic-signal-distribution reconceptualization.
[Abstract and §1] Abstract and §1: The statement that ADUX-Stat 'reconceptualises usability as a probabilistic signal distribution' is not accompanied by any model equation showing how IEI, TDC, and BUCS combine into such a distribution or how credible intervals are derived.

minor comments (1)

[Abstract] The five AI product categories referenced in the abstract and validation are never enumerated, reducing clarity for readers attempting to map the framework to concrete systems.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments, which highlight important gaps in the operationalization of our proposed framework. We agree that the current manuscript presents ADUX-Stat primarily at a conceptual level and will undertake major revisions to supply explicit mathematical definitions, model equations, and empirical comparisons. These changes will strengthen the reproducibility claims while preserving the core contribution of reconceptualizing UX evaluation for stochastic AI systems.

read point-by-point responses

Referee: [§3] §3 (Framework definition): No probability distribution, entropy formula, drift estimator, or Bayesian prior is supplied for IEI, TDC, or BUCS. This directly undermines the central claim that ADUX-Stat constitutes a 'reproducible, field-deployable methodology,' because the constructs cannot be computed from user data without these definitions.

Authors: We acknowledge that Section 3 currently defines the three constructs conceptually without the explicit formulas needed for direct computation. In the revised manuscript we will add: the probability distribution over AI response variability (a mixture model capturing stochasticity), the entropy formula for IEI (an adaptation of Shannon entropy applied to sequences of user-perceived response categories), the drift estimator for TDC (a linear mixed-effects regression coefficient on session-level usability signals), and the Bayesian priors plus likelihood for BUCS (weakly informative priors on usability parameters with credible-interval computation via MCMC). Worked numerical examples using synthetic interaction logs will be included to demonstrate field deployment. revision: yes
Referee: [Validation section] Validation section: The 'conceptual validation' consists solely of category-level matching against five AI product types with no dataset, no quantitative scores, and no head-to-head comparison against SUS or NPS baselines. This leaves the assertion that the new framework outperforms legacy scalars unsupported and prevents any test of the probabilistic-signal-distribution reconceptualization.

Authors: The original validation is indeed limited to conceptual mapping. We agree this does not provide quantitative support for superiority over SUS or NPS. In revision we will add an empirical validation subsection that analyzes anonymized interaction logs from two AI product categories (e.g., a conversational agent and a generative recommendation interface). We will compute IEI, TDC, and BUCS, report descriptive statistics and credible intervals, and perform direct comparisons with SUS and NPS scores collected from the same users, including statistical tests of differential sensitivity to temporal variability. revision: yes
Referee: [Abstract and §1] Abstract and §1: The statement that ADUX-Stat 'reconceptualises usability as a probabilistic signal distribution' is not accompanied by any model equation showing how IEI, TDC, and BUCS combine into such a distribution or how credible intervals are derived.

Authors: We will revise both the abstract and Section 1 to include a compact model equation. Usability will be expressed as a hierarchical Bayesian posterior p(U | data) where the signal distribution is a function of IEI (entropy term), TDC (drift term), and BUCS (posterior summary). Credible intervals will be defined as the 95% highest-posterior-density interval of this posterior. A short derivation sketch will be added to make the probabilistic reconceptualization explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: conceptual framework introduces additive constructs without equations or derivations

full rationale

The manuscript presents ADUX-Stat as an integrative conceptual model that redefines usability via three newly named constructs (IEI, TDC, BUCS) and offers only conceptual validation across AI product categories. No equations, probability distributions, entropy formulas, drift estimators, Bayesian priors, or computational procedures appear in the text, so no derivation chain exists that could reduce a claimed prediction or result to its own inputs by construction. The framework is explicitly additive rather than derived from legacy metrics or fitted parameters, satisfying the default expectation of no significant circularity for papers whose contribution remains at the level of naming and high-level integration without mathematical reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The proposal rests on the domain assumption that AI outputs are fundamentally stochastic in ways that invalidate scalar metrics, plus three newly postulated statistical constructs whose definitions and utility are not independently evidenced.

axioms (1)

domain assumption Legacy UX metrics such as SUS, NPS, and task completion rate were engineered only for deterministic interfaces where identical inputs yield identical outputs.
Stated directly in the abstract as the premise for needing a new framework.

invented entities (3)

Interaction Entropy Index (IEI) no independent evidence
purpose: Quantifying the unpredictability of AI responses from a user perception standpoint
Newly introduced construct with no external validation or formula supplied.
Temporal Drift Coefficient (TDC) no independent evidence
purpose: Measuring longitudinal degradation or improvement of perceived usability over interaction sessions
Newly introduced construct with no external validation or formula supplied.
Bayesian Usability Confidence Score (BUCS) no independent evidence
purpose: Producing credible interval estimates of usability quality under uncertainty
Newly introduced construct with no external validation or formula supplied.

pith-pipeline@v0.9.0 · 5529 in / 1431 out tokens · 60794 ms · 2026-05-08T07:44:09.324450+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references

[1]

Usability Engineering,

J. Nielsen, "Usability Engineering," Academic Press, 1993

1993
[2]

SUS: A quick and dirty usability scale,

J. Brooke, "SUS: A quick and dirty usability scale," in Usability Evaluation In Industry, Taylor & Francis, 1996, pp. 189–194

1996
[3]

Ergonomics of human -system interaction — Part 11: Usability: Definitions and concepts,

ISO 9241 -11:2018, "Ergonomics of human -system interaction — Part 11: Usability: Definitions and concepts," ISO, 2018

2018
[4]

Quantifying the User Experience: Practical Statistics for User Research,

J. Sauro and J. R. Lewis, "Quantifying the User Experience: Practical Statistics for User Research," 2nd ed. Morgan Kaufmann, 2016

2016
[5]

When designing usability questionnaires, does it hurt to be positive?

J. Sauro and J. R. Lewis, "When designing usability questionnaires, does it hurt to be positive?" in Proc. ACM CHI, 2011, pp. 2215–2224

2011
[6]

Measuring the User Experience,

T. Tullis and B. Albert, "Measuring the User Experience," 2nd ed. Morgan Kaufmann, 2013

2013
[7]

Software engineering for machine learning: A case study,

S. Amershi et al., "Software engineering for machine learning: A case study," in Proc. IEEE/ACM 41st ICSE, 2019, pp. 291–300

2019
[8]

The effects of example -based explanations in a machine learning interface,

C. Cai et al., "The effects of example -based explanations in a machine learning interface," in Proc. ACM IUI, 2019, pp. 258–262

2019
[9]

Chatbots and the new world of HCI,

A. Folstad and P. B. Brandtzaeg, "Chatbots and the new world of HCI," Interactions, vol. 24, no. 4, pp. 38–42, 2017

2017
[10]

Human -Centered AI,

B. Shneiderman, "Human -Centered AI," Oxford University Press, 2022

2022
[11]

Sample size in usability studies,

M. Schmettow, "Sample size in usability studies," Communications of the ACM, vol. 55, no. 4, pp. 64 –70, 2012

2012
[12]

A mathematical theory of communication,

C. E. Shannon, "A mathematical theory of communication," Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948

1948
[13]

Bayesian touch: A statistical criterion of target selection with finger touch,

X. Bi and S. Zhai, "Bayesian touch: A statistical criterion of target selection with finger touch," in Proc. ACM UIST, 2013, pp. 51–60

2013