Recognition: no theorem link
User Reviews as a Source for Usability Requirements: A Precursor Study on Using Large Language Models
Pith reviewed 2026-05-14 20:20 UTC · model grok-4.3
The pith
Large language models can identify usability requirements in user reviews with F-scores comparable to human raters when the prompt is well designed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs are generally able to recognize usability as a non-functional requirement in user reviews in terms of their F-score, but the performance and reliability is strongly dependent on the prompt. The study supplies a fully coded dataset of 300 reviews labeled by two human raters and an LLM, together with an initial prompt derived from two engineering iterations and coding guidelines based on the 10 Nielsen Usability Heuristics.
What carries the argument
An iteratively refined prompt, built from Nielsen's 10 Usability Heuristics, that directs the LLM to filter user reviews for usability-relevant content.
If this is right
- Development teams can process large volumes of user reviews quickly and at low cost to surface usability requirements.
- LLMs provide an alternative to training dedicated machine-learning classifiers for requirements classification tasks.
- The approach supports user-centered requirements elicitation by leveraging existing review data rather than new manual labeling.
- Prompt refinement becomes a central engineering activity whose outcome directly affects the quality of extracted requirements.
Where Pith is reading between the lines
- The same prompt strategy could be adapted to extract other non-functional requirements such as security or performance concerns from reviews.
- Testing the prompt on a stream of live app-store reviews would reveal how well it handles new phrasing and emerging issues.
- Embedding the classification step inside requirements-management tools could reduce the manual triage burden on product teams.
Load-bearing premise
Human raters supply consistent ground-truth labels for usability aspects and the prompt developed on this dataset will produce reliable results on new reviews or different LLMs.
What would settle it
Apply the final prompt to a fresh set of reviews independently labeled by new human raters and check whether the resulting F-score stays within the same range as the original human-to-human agreement.
Figures
read the original abstract
It is known that user-centered approaches to requirements engineering in general lead to a better suited product for the end-users. LLM4RE provides promising approaches to support the requirements elicitation process (e.g. classification of requirements). Previous approaches focus on Machine-Learning (ML) or Deep-Learning (DL) aspects, which require intensive training with a large amount of manually labeled data. LLMs, on the other hand, are pre-trained on large amounts of user-generated text data, enabling a user-centric workflow to analyze requirements. In this paper, we explore the possibility of exploiting the improved natural language understanding of LLMs, rather than strict ML classification, together with the mass extraction of user reviews to analyze if the performance of LLMs in understanding user reviews is comparable to the performance of human raters. This enables a quick and cheap workflow for development teams to gather and process their user\'s requirements. This paper provides three major contributions: (1) We provide a completely coded dataset of 300 user reviews containing usability-relevant aspects from three different types of apps, that were labeled by two human raters and by an LLM. (2) We build an initial prompt, based on two prompt engineering iterations and specifically developed coding guidelines derived from the 10 Nielsen Usability Heuristics, for LLMs to filter usability relevant user reviews. (3) We determine that LLMs are generally able to recognize usability as a non-functional requirement in user reviews, in terms of their F-score, but the performance and reliability is strongly dependent on the prompt.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper explores using large language models (LLMs) to identify usability-related content in user reviews as a precursor to requirements engineering. It contributes a labeled dataset of 300 reviews from three app types (labeled by two human raters and an LLM), develops an initial prompt via two iterations based on Nielsen's 10 usability heuristics, and claims that LLMs can generally recognize usability as a non-functional requirement with F-scores comparable to human raters, though performance and reliability depend strongly on the prompt.
Significance. If the central claims hold after proper validation, the work could support low-cost, scalable extraction of usability requirements from abundant user reviews without needing large manually labeled training sets typical of ML/DL approaches. The provision of a fully coded dataset is a clear strength for reproducibility and community use.
major comments (3)
- Abstract: The abstract reports F-score comparisons on 300 reviews but provides no numerical F-score values, no inter-rater agreement statistics, and no details on the prompt iterations or exact LLM used, which prevents verification of the claim that LLMs perform comparably to humans.
- Methodology / Labeling Process: The prompt was developed and refined iteratively on the exact same 300 reviews used for evaluation, with no held-out test set, cross-validation, or external validation described; this raises a direct risk of overfitting and weakens the conclusion that performance 'is strongly dependent on the prompt' in a generalizable way.
- Labeling Process: Only two human raters are used with no reported inter-rater reliability metric (e.g., Cohen's kappa or percentage agreement), so the reliability of the ground-truth labels is unknown and the F-score comparison to the LLM rests on an unverified foundation.
minor comments (1)
- Abstract: Consider adding the actual F-score numbers and a brief note on the LLM model/version to make the central result immediately assessable.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our precursor study. We address each major comment below and will revise the manuscript to improve clarity, transparency, and acknowledgment of limitations where appropriate.
read point-by-point responses
-
Referee: Abstract: The abstract reports F-score comparisons on 300 reviews but provides no numerical F-score values, no inter-rater agreement statistics, and no details on the prompt iterations or exact LLM used, which prevents verification of the claim that LLMs perform comparably to humans.
Authors: We agree that the abstract should be more informative. In the revised manuscript, we will include the specific F-score values for the LLM (which were comparable to those of the human raters), the inter-rater agreement statistic, the exact LLM model employed, and a concise description of the two prompt engineering iterations. This will enable readers to directly assess the comparability claim. revision: yes
-
Referee: Methodology / Labeling Process: The prompt was developed and refined iteratively on the exact same 300 reviews used for evaluation, with no held-out test set, cross-validation, or external validation described; this raises a direct risk of overfitting and weakens the conclusion that performance 'is strongly dependent on the prompt' in a generalizable way.
Authors: This observation is correct and highlights a limitation inherent to our small-scale precursor study. With only 300 reviews available, iterative prompt refinement was conducted on the full set, which is a common practice in early-stage prompt engineering but does carry overfitting risk. We will revise the methodology and discussion sections to explicitly state this limitation, note the absence of a held-out set, and qualify the generalizability of the prompt-dependence conclusion. We will also stress that the publicly released labeled dataset allows other researchers to perform independent validation on new data. revision: partial
-
Referee: Labeling Process: Only two human raters are used with no reported inter-rater reliability metric (e.g., Cohen's kappa or percentage agreement), so the reliability of the ground-truth labels is unknown and the F-score comparison to the LLM rests on an unverified foundation.
Authors: We accept this criticism. The revised manuscript will include the inter-rater reliability metric (Cohen's kappa and percentage agreement) computed between the two human raters. This addition will provide a clearer basis for interpreting the LLM's F-score performance relative to human labeling. revision: yes
Circularity Check
No circularity: direct empirical F-score comparison on labeled reviews
full rationale
The paper is a precursor empirical study that labels 300 user reviews with two human raters using Nielsen-derived guidelines, develops a prompt through two iterations, and computes F-scores for LLM classification against those human labels. No equations, derivations, fitted parameters, or predictions appear that reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The central claim rests on observable performance metrics rather than any self-referential reduction, making the work self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Nielsen's 10 Usability Heuristics provide a valid framework for identifying usability-relevant content in user reviews.
Reference graph
Works this paper leans on
-
[1]
E. Bakiu, E. Guzman, Which feature is unusable? detecting usability and user experience issues from user reviews, in: 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW), 2017, pp. 182–187. doi:10.1109/REW.2017.76
-
[2]
E. Groen, Crowd-Based Requirements Engineering, Doctoral thesis 2 (research not uu / graduation uu), Universiteit Utrecht, 2025. doi:10.33540/3091
-
[3]
L. Zhao, W. Alhoshan, A. Ferrari, K. J. Letsholo, M. A. Ajagbe, E.-V. Chioasca, R. T. Batista-Navarro, Natural language processing for requirements engineering: A systematic mapping study, ACM Comput. Surv. 54 (2021). URL: https://doi.org/10.1145/3444689. doi:10.1145/3444689
- [4]
-
[5]
M. Unterbusch, M. Sadeghi, J. Fischbach, M. Obaidi, A. Vogelsang, Explanation needs in app reviews: Taxonomy and automated detection, 2023, pp. 102–111. doi:10.1109/REW57809.2023.00024
-
[6]
F. Wei, R. Keeling, N. Huber-Fliflet, J. Zhang, A. Dabrowski, J. Yang, Q. Mao, H. Qin, Empirical study of llm fine-tuning for text classification in legal document review, in: 2023 IEEE International Conference on Big Data (BigData), 2023, pp. 2786–2792. doi: 10.1109/BigData59044.2023. 10386911
-
[7]
J. Dąbrowski, W. Cai, A. Bennaceur, B. Nuseibeh, F. Alrimawi, Intelligent agents for requirements engineering: Use, feasibility and evaluation, in: 2025 IEEE 33rd International Requirements Engineering Conference (RE), 2025, pp. 535–543. doi:10.1109/RE63999.2025.00064
-
[8]
ISO, ISO 9241-110:2020 Ergonomics of human-system interaction — Part 110: Interaction principles, International Standards Organisation (2024)
2020
-
[9]
Bevan, J
N. Bevan, J. Carter, S. Harker, Iso 9241-11 revised: What have we learnt about usability since 1998?, in: M. Kurosu (Ed.), Human-Computer Interaction: Design and Evaluation, Springer International Publishing, Cham, 2015, pp. 143–151
1998
-
[10]
D. Quiñones, C. Rusu, How to develop usability heuristics: A systematic literature review, Computer Standards & Interfaces 53 (2017) 89–122. URL: https://www.sciencedirect.com/science/ article/pii/S0920548917301058. doi:https://doi.org/10.1016/j.csi.2017.03.009
-
[11]
Nielsen, 10 usability heuristics for user interface design, 1994
J. Nielsen, 10 usability heuristics for user interface design, 1994. URL: https://www.nngroup.com/ articles/ten-usability-heuristics/, last accessed: 01/12/2026
1994
-
[12]
Nielsen, Enhancing the explanatory power of usability heuristics, in: Proceedings of the SIGCHI conference on Human Factors in Computing Systems, 1994, pp
J. Nielsen, Enhancing the explanatory power of usability heuristics, in: Proceedings of the SIGCHI conference on Human Factors in Computing Systems, 1994, pp. 152–158
1994
-
[13]
J. Nielsen, Enhancing the explanatory power of usability heuristics, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’94, Association for Computing Machinery, New York, NY, USA, 1994, p. 152–158. URL: https://doi.org/10.1145/191666.191729. doi:10.1145/191666.191729
-
[14]
J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, D. C. Schmidt, A prompt pattern catalog to enhance prompt engineering with chatgpt, 2023. URL: https://arxiv.org/abs/2302.11382.arXiv:2302.11382
-
[15]
P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing 55 (2023). URL: https://doi.org/10. 1145/3560815. doi:10.1145/3560815
-
[16]
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, A. Chadha, A systematic survey of prompt engineering in large language models: Techniques and applications, 2025. URL: https://arxiv.org/ abs/2402.07927.arXiv:2402.07927
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
The prompt report: A systematic survey of prompt engineering techniques, 2025. URL: https: //arxiv.org/abs/2406.06608.arXiv:2406.06608
work page internal anchor Pith review arXiv 2025
-
[18]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V. Le, D. Zhou, Chain- of-thought prompting elicits reasoning in large language models, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural Information Processing Systems, volume 35, Curran Associates, Inc., 2022, pp. 24824–24837. URL: https://...
2022
-
[19]
Nielsen, Usability Engineering, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1994
J. Nielsen, Usability Engineering, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1994
1994
-
[20]
S. Hedegaard, J. G. Simonsen, Extracting usability and user experience information from online user reviews, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’13, Association for Computing Machinery, New York, NY, USA, 2013, p. 2089–2098. URL: https://doi.org/10.1145/2470654.2481286. doi:10.1145/2470654.2481286
-
[21]
A. Forward, T. C. Lethbridge, A taxonomy of software types to facilitate search and evidence-based software engineering, in: Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative Research: Meeting of Minds, CASCON ’08, Association for Computing Machinery, New York, NY, USA, 2008. URL: https://doi.org/10.1145/1463788.146380...
-
[22]
Wellhausen, Supplementary material to user reviews as a source for usability requirements, 2026
C. Wellhausen, Supplementary material to user reviews as a source for usability requirements, 2026. URL: https://figshare.com/collections/Supplementary_Material_to_User_Reviews_as_a_Source_ for_Usability_Requirements/8256262/2. doi:10.6084/m9.figshare.c.8256262.v2
-
[23]
J. Cohen, A coefficient of agreement for nominal scales, Educational and Psychological Measure- ment 20 (1960) 37–46. doi:10.1177/001316446002000104
-
[24]
J. R. Landis, G. G. Koch, The measurement of observer agreement for categorical data, Biometrics 33 (1977) 159–174. doi:https://doi.org/10.2307/2529310
-
[25]
Wohlin, P
C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, A. Wesslén, et al., Experimentation in software engineering, volume 236, Springer, 2012
2012
-
[26]
URL: https://openai.com/index/gpt-4-1/, last accessed: 02/17/2026
OpenAI, Introducing gpt-4.1 in the api, 2025. URL: https://openai.com/index/gpt-4-1/, last accessed: 02/17/2026
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.