pith. machine review for the scientific record. sign in

arxiv: 2604.05782 · v1 · submitted 2026-04-07 · 💻 cs.SE

Recognition: no theorem link

An Empirical Study of Perceptions of General LLMs and Multimodal LLMs on Hugging Face

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:07 UTC · model grok-4.3

classification 💻 cs.SE
keywords user perceptionsLLM discussionsHugging Faceempirical analysismultimodal LLMsmodel deploymentgeneration qualityaccess barriers
0
0 comments X

The pith

Hugging Face discussions show that access barriers, generation quality, and deployment complexity are the biggest user concerns for both general and multimodal LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines real user discussions on Hugging Face to understand what people actually worry about when using general LLMs and multimodal LLMs. By looking at 662 threads from 38 models, the authors build a taxonomy of concerns and find that getting access to the models, the quality of the outputs they produce, and the difficulty of setting them up and running them stand out as top issues. They also note problems with poor documentation and limited computing resources. A sympathetic reader would care because these insights can guide how model creators and platforms improve the experience for everyday users rather than relying on artificial surveys or bug reports.

Core claim

The paper claims that through manual annotation of user threads on Hugging Face, a three-level taxonomy reveals LLM access barriers, generation quality, and deployment and invocation complexity as the most prominent concerns for both GLLMs and MLLMs, with documentation limitations and resource constraints also notable, leading to actionable implications for the LLM ecosystem.

What carries the argument

A three-level taxonomy developed to systematically characterize user concerns from discussion threads on Hugging Face.

If this is right

  • Model providers should prioritize simplifying access methods and reducing deployment hurdles.
  • Efforts to enhance generation quality could address a major source of user dissatisfaction.
  • Improved documentation would help mitigate common user frustrations.
  • Resource optimization is needed to make advanced models more accessible.
  • The findings apply similarly to both general and multimodal models, suggesting broad applicability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These concerns might indicate that current platforms need better onboarding tools and tutorials.
  • Future studies could compare these issues across other platforms to see if they are universal.
  • Addressing access barriers could accelerate adoption of multimodal capabilities in practical applications.
  • Developers might benefit from focusing on lightweight versions or cloud-based invocation options.

Load-bearing premise

The assumption that the 38 selected models and 662 discussion threads provide a representative sample of user experiences with general and multimodal LLMs.

What would settle it

A replication study that analyzes a larger or differently sampled set of models and threads and finds a different ranking of top concerns, such as security or ethical issues rising above the current top three.

Figures

Figures reproduced from arXiv: 2604.05782 by Jacky Keung, Xiaoxue Ma, Xiao Yu, Xing Hu, Xin Xia, Yujian Liu.

Figure 1
Figure 1. Figure 1: The overview of the research methodology. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

Large language models (LLMs) have rapidly evolved from general-purpose systems to multimodal models capable of processing text, images, and audio. As both general-purpose LLMs (GLLMs) and multimodal LLMs (MLLMs) gain widespread adoption, understanding user perceptions in real-world settings becomes increasingly important. However, existing studies often rely on surveys or platform-specific data (e.g., Reddit or GitHub issues), which either constrain user feedback through predefined questions or overemphasize failure-driven, debugging-oriented discussions, thus failing to capture diverse, experience-driven, and cross-model user perspectives in practice. To address this issue, we conduct an empirical study of user discussions on Hugging Face, a major model hub with diverse models and active communities. We collect and manually annotate 662 discussion threads from 38 representative models (21 GLLMs and 17 MLLMs), and develop a three-level taxonomy to systematically characterize user concerns. Our analysis reveals that LLM access barriers, generation quality, and deployment and invocation complexity are the most prominent concerns, alongside issues such as documentation limitations and resource constraints. Based on these findings, we derive actionable implications for improving LLM ecosystem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical study of user perceptions of general LLMs (GLLMs) and multimodal LLMs (MLLMs) by collecting and manually annotating 662 discussion threads from 38 representative models (21 GLLMs and 17 MLLMs) on the Hugging Face platform. It develops a three-level taxonomy to categorize concerns and reports that access barriers, generation quality, and deployment/invocation complexity are the most prominent issues, alongside documentation limitations and resource constraints, from which actionable implications for the LLM ecosystem are derived.

Significance. If the sample and annotations prove robust, the work provides a useful contribution by capturing experience-driven user feedback from a major model hub rather than relying on constrained surveys or failure-focused forums like GitHub issues. The creation of a three-level taxonomy offers a reusable framework for future analyses of LLM adoption barriers. This empirical grounding on real discussions is a strength that could inform model developers and platform maintainers about practical deployment challenges.

major comments (2)
  1. [Abstract / Data Collection] Abstract and Data Collection section: The 38 models are labeled 'representative' and 662 threads are collected without any stated selection criteria (e.g., popularity thresholds such as download counts, architectural diversity, parameter ranges, or release dates) or thread sampling method (e.g., all threads vs. filtered by date or relevance, and per-model distribution). This is load-bearing for the central claim identifying 'most prominent' concerns via frequency analysis, as the reported patterns could reflect selection bias rather than general prevalence.
  2. [Annotation / Taxonomy] Annotation and Taxonomy section: The manual annotation process for the 662 threads and development of the three-level taxonomy provides no details on the number of annotators, inter-rater reliability metrics (e.g., Cohen's kappa or percentage agreement), or resolution of disagreements. This omission directly affects the reliability of the taxonomy and the frequency-based prominence claims that form the paper's core findings.
minor comments (2)
  1. [Taxonomy Description] The taxonomy presentation would be clearer with one or two concrete example threads per top-level category to illustrate how concerns were classified.
  2. [Results] Some figures or tables summarizing the distribution of concerns across GLLMs vs. MLLMs could be added or clarified to strengthen the comparative analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us identify areas where the manuscript can be strengthened. We address each major comment point by point below. We have revised the manuscript to incorporate additional details on data collection and annotation processes, thereby improving transparency and addressing concerns about potential bias and reliability.

read point-by-point responses
  1. Referee: [Abstract / Data Collection] Abstract and Data Collection section: The 38 models are labeled 'representative' and 662 threads are collected without any stated selection criteria (e.g., popularity thresholds such as download counts, architectural diversity, parameter ranges, or release dates) or thread sampling method (e.g., all threads vs. filtered by date or relevance, and per-model distribution). This is load-bearing for the central claim identifying 'most prominent' concerns via frequency analysis, as the reported patterns could reflect selection bias rather than general prevalence.

    Authors: We agree that greater transparency in model and thread selection is necessary to support the frequency-based claims. The original manuscript described the models as 'representative' but did not enumerate the criteria. In the revised manuscript, we have added a dedicated subsection under Data Collection that specifies the selection process: the 38 models (21 GLLMs and 17 MLLMs) were chosen to ensure diversity across (1) popularity, measured by download counts on Hugging Face (top models in their categories), (2) architectural variety (e.g., decoder-only transformers, encoder-decoder, and multimodal architectures), (3) parameter scale (1B to 70B+), and (4) release recency (primarily 2023–2024). For threads, we collected every available discussion thread for each model up to the cutoff date without relevance filtering, yielding the 662 threads; a new table reports the per-model thread counts to allow readers to assess distribution. These additions directly mitigate selection-bias concerns while preserving the empirical grounding of the study. revision: yes

  2. Referee: [Annotation / Taxonomy] Annotation and Taxonomy section: The manual annotation process for the 662 threads and development of the three-level taxonomy provides no details on the number of annotators, inter-rater reliability metrics (e.g., Cohen's kappa or percentage agreement), or resolution of disagreements. This omission directly affects the reliability of the taxonomy and the frequency-based prominence claims that form the paper's core findings.

    Authors: We acknowledge that the absence of these methodological details limits the ability to evaluate annotation reliability. The revised manuscript now expands the Annotation and Taxonomy section with the following information: two authors with backgrounds in NLP and empirical software engineering independently annotated all 662 threads; disagreements were resolved via discussion involving a third author; the three-level taxonomy was developed iteratively through pilot coding of 50 threads followed by refinement; and inter-rater reliability was measured at Cohen’s kappa = 0.81 (substantial agreement). We also include a brief description of the coding scheme and examples of category assignment. These additions provide the requested transparency and reinforce the robustness of the taxonomy and the prominence rankings derived from it. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of annotated discussion data

full rationale

This is an empirical study that collects 662 discussion threads from 38 models on Hugging Face, manually annotates them, and reports observed patterns in user concerns via a three-level taxonomy. No equations, derivations, fitted parameters, predictions, or self-citation chains exist that could reduce the central claims to inputs by construction. The analysis directly reflects the annotated data without any load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Hugging Face discussion threads provide an unbiased window into diverse user perceptions; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Hugging Face discussion threads from the chosen 38 models represent diverse, experience-driven user perspectives on LLMs and MLLMs
    The study uses this to generalize its taxonomy and prominence rankings beyond the sampled threads.

pith-pipeline@v0.9.0 · 5522 in / 1315 out tokens · 54128 ms · 2026-05-10T19:07:13.630544+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 15 canonical work pages · 4 internal anchors

  1. [1]

    Replication Package

    2026. Replication Package. https://doi.org/10.6084/m9.figshare.31898476

  2. [2]

    Adem Ait, Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2025. On the suitability of hugging face hub for empirical studies.Empirical Software Engineering30, 2 (2025), 57

  3. [3]

    Adem Ait, Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2024. HFCommunity: An extraction process and relational database to analyze Hugging Face Hub data. Science of Computer Programming234 (2024), 103079

  4. [4]

    Adekunle Ajibode, Abdul Ali Bangash, Filipe R Cogo, Bram Adams, and Ahmed E Hassan. 2025. Towards semantic versioning of open pre-trained language model releases on hugging face.Empirical Software Engineering30, 3 (2025), 78

  5. [5]

    Afnan A Al-Subaihin, Federica Sarro, Sue Black, Licia Capra, and Mark Harman

  6. [6]

    App store effects on software engineering practices.IEEE Transactions on Software Engineering47, 2 (2019), 300–319

  7. [7]

    Furkan Alaca and Paul C Van Oorschot. 2020. Comparative analysis and frame- work evaluating web single sign-on systems.ACM Computing Surveys (CSUR) 53, 5 (2020), 1–34

  8. [8]

    C Michael Barton, Allen Lee, Marco A Janssen, Sander van der Leeuw, Gregory E Tucker, Cheryl Porter, Joshua Greenberg, Laura Swantek, Karin Frank, Min Chen, et al. 2022. How to make models more useful.Proceedings of the National Academy of Sciences119, 35 (2022), e2202112119

  9. [9]

    Margherita Bernabei, Silvia Colabianchi, Andrea Falegnami, and Francesco Costantino. 2023. Students’ use of large language models in engineering educa- tion: A case study on technology acceptance, perceptions, efficacy, and detection chances.Computers and Education: Artificial Intelligence5 (2023), 100172

  10. [10]

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al

  11. [11]

    Paper Review:’Sparks of Artificial General Intelligence: Early experiments with GPT-4’. (2023)

  12. [12]

    Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. 2024. The revolution of multimodal large language models: A survey.Findings of the association for computational linguistics: ACL 2024(2024), 13590–13618

  13. [13]

    Joel Castaño, Silverio Martínez-Fernández, and Xavier Franch. 2024. Lessons learned from mining the hugging face repository. InProceedings of the 1st IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering. 1–6

  14. [15]

    In2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

    Exploring the carbon footprint of hugging face’s ml models: A repository mining study. In2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 1–12

  15. [16]

    Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner

  16. [17]

    InProceedings of the 21st International Conference on Mining Software Repositories

    Analyzing the evolution and maintenance of ml models on hugging face. InProceedings of the 21st International Conference on Mining Software Repositories. 607–618

  17. [18]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021)

  18. [19]

    Ning Chen, Jialiu Lin, Steven CH Hoi, Xiaokui Xiao, and Boshen Zhang. 2014. AR- miner: mining informative reviews for developers from mobile app marketplace. InProceedings of the 36th international conference on software engineering. 767– 778

  19. [20]

    Xiang Chen, Chaoyang Gao, Chunyang Chen, Guangbei Zhang, and Yong Liu

  20. [21]

    An empirical study on challenges for llm application developers.ACM Transactions on Software Engineering and Methodology34, 7 (2025), 1–37

  21. [22]

    Avishek Choudhury, Yeganeh Shahsavar, and Hamid Shamszare. 2025. User intent to use DeepSeek for health care purposes and their trust in the large language model: Multinational survey study.JMIR human factors12, 1 (2025), e72867

  22. [23]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of machine learning research24, 240 (2023), 1–113

  23. [24]

    William G Cochran. 1954. Some methods for strengthening the common 𝜒 2 tests.Biometrics10, 4 (1954), 417–451

  24. [25]

    Lettie Y Conrad and Virginia M Tucker. 2019. Making it tangible: hybrid card sorting within qualitative interviews.Journal of Documentation75, 2 (2019), 397–416

  25. [26]

    Erica Coppolillo, Federico Cinus, Marco Minici, Francesco Bonchi, and Giuseppe Manco. 2025. Engagement-driven content generation with large language models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 369–379

  26. [27]

    Jacek Dąbrowski, Emmanuel Letier, Anna Perini, and Angelo Susi. 2022. Analysing app reviews for software engineering: a systematic literature review. Empirical Software Engineering27, 2 (2022), 43

  27. [28]

    Vitor Mesaque Alves de Lima and Ricardo Marcondes Marcacini. 2024. Opinion mining for app reviews: Identifying and prioritizing emerging issues for software maintenance and evolution. InProceedings of the XXIII Brazilian Symposium on Software Quality. 687–696

  28. [29]

    Fahimeh Ebrahimi and Anas Mahmoud. 2022. Unsupervised summarization of privacy concerns in mobile application reviews. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12

  29. [30]

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323(2022)

  30. [31]

    GGML Team. 2023. GGUF Specification. https://github.com/ggml-org/ggml/blob/ master/docs/gguf.md. Accessed: 2026-03-16

  31. [32]

    2023.Survey reveals AI’s impact on the developer experience

    GitHub. 2023.Survey reveals AI’s impact on the developer experience. GitHub Blog. https://github.blog/news-insights/research/survey-reveals-ais-impact-on- the-developer-experience/ Accessed: 2026-03-16

  32. [33]

    Alexandra González, Xavier Franch, David Lo, and Silverio Martínez-Fernández

  33. [34]

    Cataloguing Hugging Face Models to Software Engineering Activities: Automation and Findings.arXiv preprint arXiv:2506.03013(2025)

  34. [35]

    Emitza Guzman and Walid Maalej. 2014. How do users like this feature? a fine grained sentiment analysis of app reviews. In2014 IEEE 22nd international requirements engineering conference (RE). Ieee, 153–162

  35. [36]

    Hideaki Hata, Christoph Treude, Raula Gaikovina Kula, and Takashi Ishio. 2019. 9.6 million links in source code comments: Purpose, evolution, and decay. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 1211–1221

  36. [37]

    Xinyi Hou, Jiahao Han, Yanjie Zhao, and Haoyu Wang. 2025. Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study.arXiv preprint arXiv:2505.02502(2025)

  37. [38]

    Ye Hu and Wang Haofan. 2023. IP-Adapter-FaceID: Face ID Adapter for Text-to- Image Diffusion Models. https://huggingface.co/h94/IP-Adapter-FaceID. Ac- cessed: 2026-03-16

  38. [39]

    Kaiyu Huang, Fengran Mo, Xinyu Zhang, Hongliang Li, You Li, Yuanchi Zhang, Weijian Yi, Yulong Mao, Jinchen Liu, Yuzhuang Xu, et al. 2024. A survey on large language models with multilingualism: Recent advances and new frontiers.arXiv preprint arXiv:2405.10936(2024)

  39. [40]

    Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, An- drea Stocco, and Paolo Tonella. 2020. Taxonomy of real faults in deep learning systems. InProceedings of the ACM/IEEE 42nd international conference on software engineering. 1110–1121

  40. [41]

    Nishant Jha and Anas Mahmoud. 2019. Mining non-functional requirements from app store reviews.Empirical Software Engineering24, 6 (2019), 3659–3695

  41. [42]

    Wenxin Jiang, Nicholas Synovic, Matt Hyatt, Taylor R Schorlemmer, Rohan Sethi, Yung-Hsiang Lu, George K Thiruvathukal, and James C Davis. 2023. An empirical study of pre-trained model reuse in the hugging face deep learning model registry. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2463–2475

  42. [43]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023). Conference’17, July 2017, Washington, DC, USA anonymity

  43. [44]

    Adhishree Kathikar, Aishwarya Nair, Ben Lazarine, Agrim Sachdeva, and Sagar Samtani. 2023. Assessing the vulnerabilities of the open-source artificial intelli- gence (ai) landscape: A large-scale analysis of the hugging face platform. In2023 IEEE International Conference on Intelligence and Security Informatics (ISI). IEEE, 1–6

  44. [45]

    Krishnaveni Katta. 2025. Analyzing user perceptions of large language mod- els (Llms) on Reddit: sentiment and topic modeling of Chatgpt and DeepSeek discussions.arXiv preprint arXiv:2502.18513(2025)

  45. [46]

    Zijad Kurtanović and Walid Maalej. 2017. Mining user rationale from software reviews. In2017 IEEE 25th international requirements engineering conference (RE). IEEE, 61–70

  46. [47]

    Wen Lai, Mohsen Mesgar, and Alexander Fraser. 2024. LLMs beyond English: Scal- ing the multilingual capability of LLMs with cross-lingual feedback. InFindings of the Association for Computational Linguistics: ACL 2024. 8186–8213

  47. [48]

    Benjamin Laufer, Hamidah Oderinwale, and Jon Kleinberg. 2025. Anatomy of a machine learning ecosystem: 2 million models on hugging face.arXiv preprint arXiv:2508.06811(2025)

  48. [49]

    Jenny T Liang, Chenyang Yang, and Brad A Myers. 2024. A large-scale survey on the usability of ai programming assistants: Successes and challenges. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13

  49. [50]

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al

  50. [51]

    Holistic evaluation of language models.arXiv preprint arXiv:2211.09110 (2022)

  51. [52]

    Xun Liang, Hanyu Wang, Yezhaohui Wang, Shichao Song, Jiawei Yang, Simin Niu, Jie Hu, Dan Liu, Shunyu Yao, Feiyu Xiong, et al . 2024. Controllable text generation for large language models: A survey.arXiv preprint arXiv:2408.12599 (2024)

  52. [53]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and accel- eration.Proceedings of machine learning and systems6 (2024), 87–100

  53. [54]

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al . 2024. Mmbench: Is your multi-modal model an all-around player?. InEuropean conference on computer vision. Springer, 216–233

  54. [55]

    Yunbo Lyu, Hong Jin Kang, Ratnadira Widyasari, Julia Lawall, and David Lo

  55. [56]

    IEEE Transactions on Software Engineering50, 9 (2024), 2219–2239

    Evaluating SZZ Implementations: An Empirical Study on the Linux Kernel. IEEE Transactions on Software Engineering50, 9 (2024), 2219–2239

  56. [57]

    My productivity is boosted, but

    Yunbo Lyu, Zhou Yang, Jieke Shi, Jianming Chang, Yue Liu, and David Lo. 2025. " My productivity is boosted, but. . . " Demystifying Users’ Perception on AI Coding Assistants. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 191–203

  57. [58]

    Walid Maalej, Zijad Kurtanović, Hadeer Nabil, and Christoph Stanik. 2016. On the automatic classification of app reviews.Requirements Engineering21, 3 (2016), 311–331

  58. [59]

    Gary Marcus. 2020. The next decade in AI: four steps towards robust artificial intelligence.arXiv preprint arXiv:2002.06177(2020)

  59. [60]

    Andrew M McNutt, Chenglong Wang, Robert A Deline, and Steven M Drucker

  60. [61]

    InProceedings of the 2023 CHI conference on human factors in computing systems

    On the design of ai-powered code assistants for notebooks. InProceedings of the 2023 CHI conference on human factors in computing systems. 1–16

  61. [62]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

  62. [63]

    Dennis Pagano and Walid Maalej. 2013. User feedback in the appstore: An empirical study. In2013 21st IEEE international requirements engineering conference (RE). IEEE, 125–134

  63. [64]

    Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science50, 302 (1900), 157–175

  64. [65]

    Perplexity AI. 2025. r1-1776: A DeepSeek-R1-based model for enhanced reasoning. https://huggingface.co/perplexity-ai/r1-1776. Accessed: 2026-03-16

  65. [66]

    Minh Vu Phong, Tam The Nguyen, Hung Viet Pham, and Tung Thanh Nguyen

  66. [67]

    In2015 30th IEEE/ACM International Conference on Automated Software Engi- neering (ASE)

    Mining user opinions in mobile app reviews: A keyword-based approach (t). In2015 30th IEEE/ACM International Conference on Automated Software Engi- neering (ASE). IEEE, 749–759

  67. [68]

    Rethish Nair Rajendran, Sathish Krishna Anumula, Dileep Kumar Rai, and Sachin Agrawal. 2025. Zero Trust Security Model Implementation in Microservices Architectures Using Identity Federation.arXiv preprint arXiv:2511.04925(2025)

  68. [69]

    Agnia Sergeyuk, Yaroslav Golubev, Timofey Bryksin, and Iftekhar Ahmed. 2025. Using AI-based coding assistants in practice: State of affairs, perceptions, and ways forward.Information and Software Technology178 (2025), 107610

  69. [70]

    Donald Sharpe. 2015. Your chi-square test is statistically significant: now what?. Practical assessment, research & evaluation20, 8 (2015), n8

  70. [71]

    Aakash Sorathiya and Gouri Ginde. 2024. Towards extracting ethical concerns- related software requirements from app reviews. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 2251– 2255

  71. [72]

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al . 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on machine learning research(2023)

  72. [73]

    Muhammad Asif Suryani, Saurav Karmakar, Brigitte Mathiak, and Philipp Mayr

  73. [74]

    InProceedings of the 14th International Conference on Data Science, Technology and Applications

    Model card metadata collection from hugging face to foster multidisci- plinary ai research: A dataset. InProceedings of the 14th International Conference on Data Science, Technology and Applications. 583–590

  74. [75]

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. 2024. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9568–9578

  75. [76]

    Chaozheng Wang, Junhao Hu, Cuiyun Gao, Yu Jin, Tao Xie, Hailiang Huang, Zhenyu Lei, and Yuetang Deng. 2023. How practitioners expect code completion?. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1294–1306

  76. [77]

    Jiayin Wang, Weizhi Ma, Peijie Sun, Min Zhang, and Jian-Yun Nie. 2024. Under- standing user experience in large language model interactions.arXiv preprint arXiv:2401.08329(2024)

  77. [78]

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2024. A survey on multimodal large language models.National Science Review11, 12 (2024), nwae403

  78. [79]

    Guangba Yu, Zirui Wang, Yujie Huang, Renyi Zhong, Yuedong Zhong, Yilun Wang, and Michael R Lyu. 2026. Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs.arXiv preprint arXiv:2601.13655(2026)

  79. [80]

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. 2024. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9556–9567

  80. [81]

    Huan Zhang, Wei Cheng, Yuhan Wu, and Wei Hu. 2024. A pair programming framework for code generation via multi-plan exploration and feedback-driven refinement. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1319–1331

Showing first 80 references.