Recognition: no theorem link
An Empirical Study of Perceptions of General LLMs and Multimodal LLMs on Hugging Face
Pith reviewed 2026-05-10 19:07 UTC · model grok-4.3
The pith
Hugging Face discussions show that access barriers, generation quality, and deployment complexity are the biggest user concerns for both general and multimodal LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that through manual annotation of user threads on Hugging Face, a three-level taxonomy reveals LLM access barriers, generation quality, and deployment and invocation complexity as the most prominent concerns for both GLLMs and MLLMs, with documentation limitations and resource constraints also notable, leading to actionable implications for the LLM ecosystem.
What carries the argument
A three-level taxonomy developed to systematically characterize user concerns from discussion threads on Hugging Face.
If this is right
- Model providers should prioritize simplifying access methods and reducing deployment hurdles.
- Efforts to enhance generation quality could address a major source of user dissatisfaction.
- Improved documentation would help mitigate common user frustrations.
- Resource optimization is needed to make advanced models more accessible.
- The findings apply similarly to both general and multimodal models, suggesting broad applicability.
Where Pith is reading between the lines
- These concerns might indicate that current platforms need better onboarding tools and tutorials.
- Future studies could compare these issues across other platforms to see if they are universal.
- Addressing access barriers could accelerate adoption of multimodal capabilities in practical applications.
- Developers might benefit from focusing on lightweight versions or cloud-based invocation options.
Load-bearing premise
The assumption that the 38 selected models and 662 discussion threads provide a representative sample of user experiences with general and multimodal LLMs.
What would settle it
A replication study that analyzes a larger or differently sampled set of models and threads and finds a different ranking of top concerns, such as security or ethical issues rising above the current top three.
Figures
read the original abstract
Large language models (LLMs) have rapidly evolved from general-purpose systems to multimodal models capable of processing text, images, and audio. As both general-purpose LLMs (GLLMs) and multimodal LLMs (MLLMs) gain widespread adoption, understanding user perceptions in real-world settings becomes increasingly important. However, existing studies often rely on surveys or platform-specific data (e.g., Reddit or GitHub issues), which either constrain user feedback through predefined questions or overemphasize failure-driven, debugging-oriented discussions, thus failing to capture diverse, experience-driven, and cross-model user perspectives in practice. To address this issue, we conduct an empirical study of user discussions on Hugging Face, a major model hub with diverse models and active communities. We collect and manually annotate 662 discussion threads from 38 representative models (21 GLLMs and 17 MLLMs), and develop a three-level taxonomy to systematically characterize user concerns. Our analysis reveals that LLM access barriers, generation quality, and deployment and invocation complexity are the most prominent concerns, alongside issues such as documentation limitations and resource constraints. Based on these findings, we derive actionable implications for improving LLM ecosystem.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts an empirical study of user perceptions of general LLMs (GLLMs) and multimodal LLMs (MLLMs) by collecting and manually annotating 662 discussion threads from 38 representative models (21 GLLMs and 17 MLLMs) on the Hugging Face platform. It develops a three-level taxonomy to categorize concerns and reports that access barriers, generation quality, and deployment/invocation complexity are the most prominent issues, alongside documentation limitations and resource constraints, from which actionable implications for the LLM ecosystem are derived.
Significance. If the sample and annotations prove robust, the work provides a useful contribution by capturing experience-driven user feedback from a major model hub rather than relying on constrained surveys or failure-focused forums like GitHub issues. The creation of a three-level taxonomy offers a reusable framework for future analyses of LLM adoption barriers. This empirical grounding on real discussions is a strength that could inform model developers and platform maintainers about practical deployment challenges.
major comments (2)
- [Abstract / Data Collection] Abstract and Data Collection section: The 38 models are labeled 'representative' and 662 threads are collected without any stated selection criteria (e.g., popularity thresholds such as download counts, architectural diversity, parameter ranges, or release dates) or thread sampling method (e.g., all threads vs. filtered by date or relevance, and per-model distribution). This is load-bearing for the central claim identifying 'most prominent' concerns via frequency analysis, as the reported patterns could reflect selection bias rather than general prevalence.
- [Annotation / Taxonomy] Annotation and Taxonomy section: The manual annotation process for the 662 threads and development of the three-level taxonomy provides no details on the number of annotators, inter-rater reliability metrics (e.g., Cohen's kappa or percentage agreement), or resolution of disagreements. This omission directly affects the reliability of the taxonomy and the frequency-based prominence claims that form the paper's core findings.
minor comments (2)
- [Taxonomy Description] The taxonomy presentation would be clearer with one or two concrete example threads per top-level category to illustrate how concerns were classified.
- [Results] Some figures or tables summarizing the distribution of concerns across GLLMs vs. MLLMs could be added or clarified to strengthen the comparative analysis.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which have helped us identify areas where the manuscript can be strengthened. We address each major comment point by point below. We have revised the manuscript to incorporate additional details on data collection and annotation processes, thereby improving transparency and addressing concerns about potential bias and reliability.
read point-by-point responses
-
Referee: [Abstract / Data Collection] Abstract and Data Collection section: The 38 models are labeled 'representative' and 662 threads are collected without any stated selection criteria (e.g., popularity thresholds such as download counts, architectural diversity, parameter ranges, or release dates) or thread sampling method (e.g., all threads vs. filtered by date or relevance, and per-model distribution). This is load-bearing for the central claim identifying 'most prominent' concerns via frequency analysis, as the reported patterns could reflect selection bias rather than general prevalence.
Authors: We agree that greater transparency in model and thread selection is necessary to support the frequency-based claims. The original manuscript described the models as 'representative' but did not enumerate the criteria. In the revised manuscript, we have added a dedicated subsection under Data Collection that specifies the selection process: the 38 models (21 GLLMs and 17 MLLMs) were chosen to ensure diversity across (1) popularity, measured by download counts on Hugging Face (top models in their categories), (2) architectural variety (e.g., decoder-only transformers, encoder-decoder, and multimodal architectures), (3) parameter scale (1B to 70B+), and (4) release recency (primarily 2023–2024). For threads, we collected every available discussion thread for each model up to the cutoff date without relevance filtering, yielding the 662 threads; a new table reports the per-model thread counts to allow readers to assess distribution. These additions directly mitigate selection-bias concerns while preserving the empirical grounding of the study. revision: yes
-
Referee: [Annotation / Taxonomy] Annotation and Taxonomy section: The manual annotation process for the 662 threads and development of the three-level taxonomy provides no details on the number of annotators, inter-rater reliability metrics (e.g., Cohen's kappa or percentage agreement), or resolution of disagreements. This omission directly affects the reliability of the taxonomy and the frequency-based prominence claims that form the paper's core findings.
Authors: We acknowledge that the absence of these methodological details limits the ability to evaluate annotation reliability. The revised manuscript now expands the Annotation and Taxonomy section with the following information: two authors with backgrounds in NLP and empirical software engineering independently annotated all 662 threads; disagreements were resolved via discussion involving a third author; the three-level taxonomy was developed iteratively through pilot coding of 50 threads followed by refinement; and inter-rater reliability was measured at Cohen’s kappa = 0.81 (substantial agreement). We also include a brief description of the coding scheme and examples of category assignment. These additions provide the requested transparency and reinforce the robustness of the taxonomy and the prominence rankings derived from it. revision: yes
Circularity Check
No circularity: purely empirical reporting of annotated discussion data
full rationale
This is an empirical study that collects 662 discussion threads from 38 models on Hugging Face, manually annotates them, and reports observed patterns in user concerns via a three-level taxonomy. No equations, derivations, fitted parameters, predictions, or self-citation chains exist that could reduce the central claims to inputs by construction. The analysis directly reflects the annotated data without any load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Hugging Face discussion threads from the chosen 38 models represent diverse, experience-driven user perspectives on LLMs and MLLMs
Reference graph
Works this paper leans on
-
[1]
2026. Replication Package. https://doi.org/10.6084/m9.figshare.31898476
-
[2]
Adem Ait, Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2025. On the suitability of hugging face hub for empirical studies.Empirical Software Engineering30, 2 (2025), 57
2025
-
[3]
Adem Ait, Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2024. HFCommunity: An extraction process and relational database to analyze Hugging Face Hub data. Science of Computer Programming234 (2024), 103079
2024
-
[4]
Adekunle Ajibode, Abdul Ali Bangash, Filipe R Cogo, Bram Adams, and Ahmed E Hassan. 2025. Towards semantic versioning of open pre-trained language model releases on hugging face.Empirical Software Engineering30, 3 (2025), 78
2025
-
[5]
Afnan A Al-Subaihin, Federica Sarro, Sue Black, Licia Capra, and Mark Harman
-
[6]
App store effects on software engineering practices.IEEE Transactions on Software Engineering47, 2 (2019), 300–319
2019
-
[7]
Furkan Alaca and Paul C Van Oorschot. 2020. Comparative analysis and frame- work evaluating web single sign-on systems.ACM Computing Surveys (CSUR) 53, 5 (2020), 1–34
2020
-
[8]
C Michael Barton, Allen Lee, Marco A Janssen, Sander van der Leeuw, Gregory E Tucker, Cheryl Porter, Joshua Greenberg, Laura Swantek, Karin Frank, Min Chen, et al. 2022. How to make models more useful.Proceedings of the National Academy of Sciences119, 35 (2022), e2202112119
2022
-
[9]
Margherita Bernabei, Silvia Colabianchi, Andrea Falegnami, and Francesco Costantino. 2023. Students’ use of large language models in engineering educa- tion: A case study on technology acceptance, perceptions, efficacy, and detection chances.Computers and Education: Artificial Intelligence5 (2023), 100172
2023
-
[10]
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al
-
[11]
Paper Review:’Sparks of Artificial General Intelligence: Early experiments with GPT-4’. (2023)
2023
-
[12]
Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. 2024. The revolution of multimodal large language models: A survey.Findings of the association for computational linguistics: ACL 2024(2024), 13590–13618
2024
-
[13]
Joel Castaño, Silverio Martínez-Fernández, and Xavier Franch. 2024. Lessons learned from mining the hugging face repository. InProceedings of the 1st IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering. 1–6
2024
-
[15]
In2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)
Exploring the carbon footprint of hugging face’s ml models: A repository mining study. In2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 1–12
-
[16]
Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner
-
[17]
InProceedings of the 21st International Conference on Mining Software Repositories
Analyzing the evolution and maintenance of ml models on hugging face. InProceedings of the 21st International Conference on Mining Software Repositories. 607–618
-
[18]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
Ning Chen, Jialiu Lin, Steven CH Hoi, Xiaokui Xiao, and Boshen Zhang. 2014. AR- miner: mining informative reviews for developers from mobile app marketplace. InProceedings of the 36th international conference on software engineering. 767– 778
2014
-
[20]
Xiang Chen, Chaoyang Gao, Chunyang Chen, Guangbei Zhang, and Yong Liu
-
[21]
An empirical study on challenges for llm application developers.ACM Transactions on Software Engineering and Methodology34, 7 (2025), 1–37
2025
-
[22]
Avishek Choudhury, Yeganeh Shahsavar, and Hamid Shamszare. 2025. User intent to use DeepSeek for health care purposes and their trust in the large language model: Multinational survey study.JMIR human factors12, 1 (2025), e72867
2025
-
[23]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of machine learning research24, 240 (2023), 1–113
2023
-
[24]
William G Cochran. 1954. Some methods for strengthening the common 𝜒 2 tests.Biometrics10, 4 (1954), 417–451
1954
-
[25]
Lettie Y Conrad and Virginia M Tucker. 2019. Making it tangible: hybrid card sorting within qualitative interviews.Journal of Documentation75, 2 (2019), 397–416
2019
-
[26]
Erica Coppolillo, Federico Cinus, Marco Minici, Francesco Bonchi, and Giuseppe Manco. 2025. Engagement-driven content generation with large language models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 369–379
2025
-
[27]
Jacek Dąbrowski, Emmanuel Letier, Anna Perini, and Angelo Susi. 2022. Analysing app reviews for software engineering: a systematic literature review. Empirical Software Engineering27, 2 (2022), 43
2022
-
[28]
Vitor Mesaque Alves de Lima and Ricardo Marcondes Marcacini. 2024. Opinion mining for app reviews: Identifying and prioritizing emerging issues for software maintenance and evolution. InProceedings of the XXIII Brazilian Symposium on Software Quality. 687–696
2024
-
[29]
Fahimeh Ebrahimi and Anas Mahmoud. 2022. Unsupervised summarization of privacy concerns in mobile application reviews. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12
2022
-
[30]
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
GGML Team. 2023. GGUF Specification. https://github.com/ggml-org/ggml/blob/ master/docs/gguf.md. Accessed: 2026-03-16
2023
-
[32]
2023.Survey reveals AI’s impact on the developer experience
GitHub. 2023.Survey reveals AI’s impact on the developer experience. GitHub Blog. https://github.blog/news-insights/research/survey-reveals-ais-impact-on- the-developer-experience/ Accessed: 2026-03-16
2023
-
[33]
Alexandra González, Xavier Franch, David Lo, and Silverio Martínez-Fernández
- [34]
-
[35]
Emitza Guzman and Walid Maalej. 2014. How do users like this feature? a fine grained sentiment analysis of app reviews. In2014 IEEE 22nd international requirements engineering conference (RE). Ieee, 153–162
2014
-
[36]
Hideaki Hata, Christoph Treude, Raula Gaikovina Kula, and Takashi Ishio. 2019. 9.6 million links in source code comments: Purpose, evolution, and decay. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 1211–1221
2019
- [37]
-
[38]
Ye Hu and Wang Haofan. 2023. IP-Adapter-FaceID: Face ID Adapter for Text-to- Image Diffusion Models. https://huggingface.co/h94/IP-Adapter-FaceID. Ac- cessed: 2026-03-16
2023
- [39]
-
[40]
Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, An- drea Stocco, and Paolo Tonella. 2020. Taxonomy of real faults in deep learning systems. InProceedings of the ACM/IEEE 42nd international conference on software engineering. 1110–1121
2020
-
[41]
Nishant Jha and Anas Mahmoud. 2019. Mining non-functional requirements from app store reviews.Empirical Software Engineering24, 6 (2019), 3659–3695
2019
-
[42]
Wenxin Jiang, Nicholas Synovic, Matt Hyatt, Taylor R Schorlemmer, Rohan Sethi, Yung-Hsiang Lu, George K Thiruvathukal, and James C Davis. 2023. An empirical study of pre-trained model reuse in the hugging face deep learning model registry. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2463–2475
2023
-
[43]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023). Conference’17, July 2017, Washington, DC, USA anonymity
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Adhishree Kathikar, Aishwarya Nair, Ben Lazarine, Agrim Sachdeva, and Sagar Samtani. 2023. Assessing the vulnerabilities of the open-source artificial intelli- gence (ai) landscape: A large-scale analysis of the hugging face platform. In2023 IEEE International Conference on Intelligence and Security Informatics (ISI). IEEE, 1–6
2023
- [45]
-
[46]
Zijad Kurtanović and Walid Maalej. 2017. Mining user rationale from software reviews. In2017 IEEE 25th international requirements engineering conference (RE). IEEE, 61–70
2017
-
[47]
Wen Lai, Mohsen Mesgar, and Alexander Fraser. 2024. LLMs beyond English: Scal- ing the multilingual capability of LLMs with cross-lingual feedback. InFindings of the Association for Computational Linguistics: ACL 2024. 8186–8213
2024
- [48]
-
[49]
Jenny T Liang, Chenyang Yang, and Brad A Myers. 2024. A large-scale survey on the usability of ai programming assistants: Successes and challenges. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13
2024
-
[50]
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al
-
[51]
Holistic evaluation of language models.arXiv preprint arXiv:2211.09110 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [52]
-
[53]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and accel- eration.Proceedings of machine learning and systems6 (2024), 87–100
2024
-
[54]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al . 2024. Mmbench: Is your multi-modal model an all-around player?. InEuropean conference on computer vision. Springer, 216–233
2024
-
[55]
Yunbo Lyu, Hong Jin Kang, Ratnadira Widyasari, Julia Lawall, and David Lo
-
[56]
IEEE Transactions on Software Engineering50, 9 (2024), 2219–2239
Evaluating SZZ Implementations: An Empirical Study on the Linux Kernel. IEEE Transactions on Software Engineering50, 9 (2024), 2219–2239
2024
-
[57]
My productivity is boosted, but
Yunbo Lyu, Zhou Yang, Jieke Shi, Jianming Chang, Yue Liu, and David Lo. 2025. " My productivity is boosted, but. . . " Demystifying Users’ Perception on AI Coding Assistants. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 191–203
2025
-
[58]
Walid Maalej, Zijad Kurtanović, Hadeer Nabil, and Christoph Stanik. 2016. On the automatic classification of app reviews.Requirements Engineering21, 3 (2016), 311–331
2016
- [59]
-
[60]
Andrew M McNutt, Chenglong Wang, Robert A Deline, and Steven M Drucker
-
[61]
InProceedings of the 2023 CHI conference on human factors in computing systems
On the design of ai-powered code assistants for notebooks. InProceedings of the 2023 CHI conference on human factors in computing systems. 1–16
2023
-
[62]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744
2022
-
[63]
Dennis Pagano and Walid Maalej. 2013. User feedback in the appstore: An empirical study. In2013 21st IEEE international requirements engineering conference (RE). IEEE, 125–134
2013
-
[64]
Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science50, 302 (1900), 157–175
1900
-
[65]
Perplexity AI. 2025. r1-1776: A DeepSeek-R1-based model for enhanced reasoning. https://huggingface.co/perplexity-ai/r1-1776. Accessed: 2026-03-16
2025
-
[66]
Minh Vu Phong, Tam The Nguyen, Hung Viet Pham, and Tung Thanh Nguyen
-
[67]
In2015 30th IEEE/ACM International Conference on Automated Software Engi- neering (ASE)
Mining user opinions in mobile app reviews: A keyword-based approach (t). In2015 30th IEEE/ACM International Conference on Automated Software Engi- neering (ASE). IEEE, 749–759
- [68]
-
[69]
Agnia Sergeyuk, Yaroslav Golubev, Timofey Bryksin, and Iftekhar Ahmed. 2025. Using AI-based coding assistants in practice: State of affairs, perceptions, and ways forward.Information and Software Technology178 (2025), 107610
2025
-
[70]
Donald Sharpe. 2015. Your chi-square test is statistically significant: now what?. Practical assessment, research & evaluation20, 8 (2015), n8
2015
-
[71]
Aakash Sorathiya and Gouri Ginde. 2024. Towards extracting ethical concerns- related software requirements from app reviews. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 2251– 2255
2024
-
[72]
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al . 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on machine learning research(2023)
2023
-
[73]
Muhammad Asif Suryani, Saurav Karmakar, Brigitte Mathiak, and Philipp Mayr
-
[74]
InProceedings of the 14th International Conference on Data Science, Technology and Applications
Model card metadata collection from hugging face to foster multidisci- plinary ai research: A dataset. InProceedings of the 14th International Conference on Data Science, Technology and Applications. 583–590
-
[75]
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. 2024. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9568–9578
2024
-
[76]
Chaozheng Wang, Junhao Hu, Cuiyun Gao, Yu Jin, Tao Xie, Hailiang Huang, Zhenyu Lei, and Yuetang Deng. 2023. How practitioners expect code completion?. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1294–1306
2023
- [77]
-
[78]
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2024. A survey on multimodal large language models.National Science Review11, 12 (2024), nwae403
2024
- [79]
-
[80]
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. 2024. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9556–9567
2024
-
[81]
Huan Zhang, Wei Cheng, Yuhan Wu, and Wei Hu. 2024. A pair programming framework for code generation via multi-plan exploration and feedback-driven refinement. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1319–1331
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.