Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models
Pith reviewed 2026-05-24 06:10 UTC · model grok-4.3
The pith
Bigger large language models outperform fine-tuned smaller ones on software engineering sentiment tasks with limited or imbalanced data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our experimental findings demonstrate that bLLMs exhibit state-of-the-art performance on datasets marked by limited training data and imbalanced distributions. bLLMs can also achieve excellent performance under a zero-shot setting. However, when ample training data is available or the dataset exhibits a more balanced distribution, fine-tuned sLLMs can still achieve superior results.
What carries the argument
Direct comparison of bigger LLMs in zero-shot and few-shot settings against fine-tuned smaller LLMs on five software engineering sentiment datasets.
If this is right
- Bigger LLMs set the performance standard for sentiment analysis on small or imbalanced software engineering datasets.
- Zero-shot use of bigger LLMs becomes a practical option for these tasks without any labeled examples.
- Fine-tuned smaller LLMs retain the advantage when training data is both abundant and balanced.
- Practitioners should choose the model type according to the volume and balance of available data rather than defaulting to one approach.
- Reliance on large-scale manual labeling for software engineering sentiment tasks can be reduced in many common cases.
Where Pith is reading between the lines
- Teams could begin with zero-shot bigger LLMs for immediate feedback on new platforms and later fine-tune smaller models once labeled data accumulates.
- The same data-volume and balance logic may apply to other software engineering classification tasks such as bug severity or requirement prioritization.
- Results on proprietary or streaming software engineering data would provide a stronger test than the five public datasets alone.
- Variations in prompting or example selection could further improve bigger LLM results without additional training data.
Load-bearing premise
The five established datasets are representative of real-world software engineering sentiment tasks and the zero-shot, few-shot, and fine-tuning setups for bigger and smaller LLMs are implemented under comparable conditions without undisclosed advantages.
What would settle it
A controlled test on a sixth large balanced software engineering sentiment dataset in which fine-tuned smaller LLMs fail to outperform the bigger LLMs, or in which the bigger LLM results show clear but hidden implementation advantages.
Figures
read the original abstract
Software development involves collaborative interactions where stakeholders express opinions across various platforms. Recognizing the sentiments conveyed in these interactions is crucial for the effective development and ongoing maintenance of software systems. For software products, analyzing the sentiment of user feedback, e.g., reviews, comments, and forum posts can provide valuable insights into user satisfaction and areas for improvement. This can guide the development of future updates and features. However, accurately identifying sentiments in software engineering datasets remains challenging. This study investigates bigger large language models (bLLMs) in addressing the labeled data shortage that hampers fine-tuned smaller large language models (sLLMs) in software engineering tasks. We conduct a comprehensive empirical study using five established datasets to assess three open-source bLLMs in zero-shot and few-shot scenarios. Additionally, we compare them with fine-tuned sLLMs, using sLLMs to learn contextual embeddings of text from software platforms. Our experimental findings demonstrate that bLLMs exhibit state-of-the-art performance on datasets marked by limited training data and imbalanced distributions. bLLMs can also achieve excellent performance under a zero-shot setting. However, when ample training data is available or the dataset exhibits a more balanced distribution, fine-tuned sLLMs can still achieve superior results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an empirical study comparing three open-source bigger large language models (bLLMs) in zero-shot and few-shot settings against fine-tuned smaller large language models (sLLMs) for sentiment analysis on five established software engineering datasets. It claims that bLLMs achieve state-of-the-art performance on datasets with limited training data and imbalanced distributions (including zero-shot), while fine-tuned sLLMs outperform when ample balanced training data is available.
Significance. If the experimental protocols are shown to be comparable, the results would provide actionable guidance for SE practitioners facing labeled-data scarcity in sentiment tasks, clarifying when scale (bLLMs) versus task-specific fine-tuning (sLLMs) is preferable.
major comments (3)
- [§4] §4 (Experimental Setup): The exact prompt templates, example-selection criteria for few-shot, and output-parsing rules for the bLLM zero/few-shot evaluations are not reported. This detail is load-bearing for the central claim, because any dataset-specific phrasing or post-processing could confer an undisclosed advantage relative to the standardized fine-tuning protocol used for sLLMs.
- [Table 1, §5.1] Table 1 (Dataset Characteristics) and §5.1 (Results per Dataset): Training-set sizes, class-balance ratios, and the precise train/test splits are not tabulated or statistically tested. Without these, it is impossible to verify the headline distinction that bLLMs win on “limited training data and imbalanced distributions” while sLLMs win on “ample balanced data.”
- [§5.2] §5.2 (Comparison Protocol): No description is given of whether identical preprocessing, tokenization, or label-mapping pipelines were applied to both model families, nor whether any post-hoc filtering of predictions occurred. This omission directly affects the validity of attributing performance gaps to model scale alone.
minor comments (2)
- [Abstract, §3] The abstract and §3 refer to “five established datasets” without naming them until later; an early table listing the datasets, their sources, and sizes would improve readability.
- [Figures] Figure captions and axis labels in the performance plots use inconsistent abbreviations for model names; standardize to the nomenclature introduced in §4.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of reproducibility and experimental transparency. We address each major comment below and will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Setup): The exact prompt templates, example-selection criteria for few-shot, and output-parsing rules for the bLLM zero/few-shot evaluations are not reported. This detail is load-bearing for the central claim, because any dataset-specific phrasing or post-processing could confer an undisclosed advantage relative to the standardized fine-tuning protocol used for sLLMs.
Authors: We agree that these implementation details are critical for reproducibility and for validating the central claims. In the revised manuscript, we will expand §4 to include the exact prompt templates used for each bLLM in both zero-shot and few-shot settings, the criteria for selecting few-shot examples (including whether selection was random, stratified by class, or based on other heuristics), and the precise rules for parsing model outputs into sentiment labels. These additions will enable direct comparison with the sLLM fine-tuning protocol. revision: yes
-
Referee: [Table 1, §5.1] Table 1 (Dataset Characteristics) and §5.1 (Results per Dataset): Training-set sizes, class-balance ratios, and the precise train/test splits are not tabulated or statistically tested. Without these, it is impossible to verify the headline distinction that bLLMs win on “limited training data and imbalanced distributions” while sLLMs win on “ample balanced data.”
Authors: We acknowledge that the current Table 1 and §5.1 lack sufficient quantitative detail on dataset characteristics. We will revise Table 1 to report training-set sizes, class-balance ratios (e.g., proportions of positive/negative/neutral), and the exact train/test split ratios or indices used for each of the five datasets. We will also add a short statistical summary or discussion in §5.1 to support the distinction between limited/imbalanced versus ample/balanced data regimes. revision: yes
-
Referee: [§5.2] §5.2 (Comparison Protocol): No description is given of whether identical preprocessing, tokenization, or label-mapping pipelines were applied to both model families, nor whether any post-hoc filtering of predictions occurred. This omission directly affects the validity of attributing performance gaps to model scale alone.
Authors: We will clarify the comparison protocol in the revised §5.2. Identical preprocessing steps (text normalization, removal of URLs/special characters, and lowercasing where applicable) and the same label-mapping rules were applied to both bLLM and sLLM pipelines; we will state this explicitly. Tokenization is inherently model-specific and will be noted as such. No post-hoc filtering or selective discarding of predictions was performed on either model family. These details will be added to ensure the attribution of performance differences is transparent. revision: yes
Circularity Check
No circularity: pure empirical comparison on fixed public datasets
full rationale
The paper performs an empirical evaluation of bLLMs vs. sLLMs on five established public datasets using zero-shot, few-shot, and fine-tuning protocols. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All performance claims are measured directly against held-out labels, satisfying the self-contained benchmark criterion. No load-bearing step reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The five established datasets are representative of software engineering sentiment analysis tasks.
- domain assumption Zero-shot and few-shot prompting for bLLMs and fine-tuning procedures for sLLMs are implemented without hidden advantages or inconsistent preprocessing.
Reference graph
Works this paper leans on
-
[1]
Toufique Ahmed, Amiangshu Bosu, Anindya Iqbal, and Shahram Rahimi. 2017. SentiCR: A customized sentiment analysis tool for code review interactions. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 106–111
work page 2017
-
[2]
Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl Barr. 2024. Automatic semantic augmentation of language model prompts (for code summarization). In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13
work page 2024
-
[3]
Eeshita Biswas, Mehmet Efruz Karabulut, Lori Pollock, and K Vijay-Shanker. 2020. Achieving reliable sentiment analysis in the software engineering domain using bert. In 2020 IEEE International conference on software maintenance and evolution (ICSME). IEEE, 162–173
work page 2020
-
[4]
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Leo Breiman. 2001. Random forests. Machine learning 45 (2001), 5–32
work page 2001
-
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901
work page 2020
-
[7]
Fabio Calefato, Filippo Lanubile, Federico Maiorano, and Nicole Novielli. 2018. Sentiment polarity detection for software development. In Proceedings of the 40th International Conference on Software Engineering . 128–128. J. ACM, Vol. 37, No. 4, Article 1. Publication date: September 2024. Revisiting Sentiment Analysis for Software Engineering in the Era ...
work page 2018
-
[8]
Fabio Calefato, Filippo Lanubile, Nicole Novielli, and Luigi Quaranta. 2019. Emtk-the emotion mining toolkit. In 2019 IEEE/ACM 4th International Workshop on Emotion A wareness in Software Engineering (SEmotion). IEEE, 34–37
work page 2019
-
[9]
Ning Chen, Steven CH Hoi, Shaohua Li, and Xiaokui Xiao. 2015. SimApp: A framework for detecting similar mobile applications by online kernel learning. In Proceedings of the eighth ACM international conference on web search and data mining. 305–314
work page 2015
-
[10]
Zhenpeng Chen, Yanbin Cao, Xuan Lu, Qiaozhu Mei, and Xuanzhe Liu. 2019. Sentimoji: an emoji-powered learning approach for sentiment analysis in software engineering. In Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering . 841–852
work page 2019
-
[11]
Zhenpeng Chen, Yanbin Cao, Huihan Yao, Xuan Lu, Xin Peng, Hong Mei, and Xuanzhe Liu. 2021. Emoji-powered sentiment and emotion detection from software developers’ communication data. ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 2 (2021), 1–48
work page 2021
-
[12]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
work page 2023
-
[13]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Xiang Deng, Vasilisa Bashlovkina, Feng Han, Simon Baumgartner, and Michael Bendersky. 2023. LLMs to the Moon? Reddit Market Sentiment Analysis with Large Language Models. In Companion Proceedings of the ACM Web Conference
work page 2023
-
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186
work page 2019
-
[16]
Zulfadzli Drus and Haliyana Khalid. 2019. Sentiment analysis in social media and its application: Systematic literature review. Procedia Computer Science 161 (2019), 707–714
work page 2019
-
[17]
Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, et al. [n. d.]. GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding. In Forty-first International Conference on Machine Learning
-
[18]
Cunxiao Du, Zhaopeng Tu, and Jing Jiang. 2021. Order-agnostic cross entropy for non-autoregressive machine translation. In International conference on machine learning . PMLR, 2849–2859
work page 2021
-
[19]
Tom Fawcett. 2006. An introduction to ROC analysis. Pattern recognition letters 27, 8 (2006), 861–874
work page 2006
-
[20]
Sidong Feng and Chunyang Chen. 2024. Prompting Is All You Need: Automated Android Bug Replay with Large Language Models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering . 1–13
work page 2024
-
[21]
Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Ge Li, Zhi Jin, Xiaoguang Mao, and Xiangke Liao
-
[22]
In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering
Large language models are few-shot summarizers: Multi-intent comment generation via in-context learning. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering . 1–13
-
[23]
Mia Mohammad Imran, Yashasvi Jain, Preetha Chatterjee, and Kostadin Damevski. 2022. Data Augmentation for Improving Emotion Recognition in Software Engineering Communication. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering . 1–13
work page 2022
-
[24]
Ivana Clairine Irsan, Ting Zhang, Ferdian Thung, Kisub Kim, and David Lo. 2023. Multi-modal api recommendation. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 272–283
work page 2023
-
[25]
Md Rakibul Islam, Md Kauser Ahmmed, and Minhaz F Zibran. 2019. MarValous: Machine learning based detection of emotions in the valence-arousal space in software engineering text. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. 1786–1793
work page 2019
-
[26]
Md Rakibul Islam and Minhaz F Zibran. 2018. DEVA: sensing emotions in the valence arousal space in software engineering text. In Proceedings of the 33rd annual ACM symposium on applied computing . 1536–1543
work page 2018
-
[27]
Md Rakibul Islam and Minhaz F Zibran. 2018. SentiStrength-SE: Exploiting domain specificity for improved sentiment analysis in software engineering text. Journal of Systems and Software 145 (2018), 125–146
work page 2018
-
[28]
Robbert Jongeling, Subhajit Datta, and Alexander Serebrenik. 2015. Choosing your weapons: On sentiment analysis tools for software engineering research. In 2015 IEEE international conference on software maintenance and evolution (ICSME). IEEE, 531–535
work page 2015
-
[29]
Robbert Jongeling, Proshanta Sarkar, Subhajit Datta, and Alexander Serebrenik. 2017. On negative results when using sentiment analysis tools for software engineering research. Empirical Software Engineering 22 (2017), 2543–2584
work page 2017
-
[30]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. (2020)
work page 2020
-
[31]
Bin Lin, Nathan Cassee, Alexander Serebrenik, Gabriele Bavota, Nicole Novielli, and Michele Lanza. 2022. Opinion mining for software development: a systematic literature review. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 3 (2022), 1–41. J. ACM, Vol. 37, No. 4, Article 1. Publication date: September 2024. 1:28 Zhang et al
work page 2022
-
[32]
Bin Lin, Fiorella Zampetti, Gabriele Bavota, Massimiliano Di Penta, Michele Lanza, and Rocco Oliveto. 2018. Sentiment analysis for software engineering: How far can we go?. In Proceedings of the 40th international conference on software engineering. 94–104
work page 2018
-
[33]
Bing Liu. 2020. Sentiment analysis: Mining opinions, sentiments, and emotions . Cambridge university press
work page 2020
-
[34]
Bing Liu et al. 2010. Sentiment analysis and subjectivity. Handbook of natural language processing 2, 2010 (2010), 627–666
work page 2010
-
[35]
Bing Liu and Lei Zhang. 2012. A survey of opinion mining and sentiment analysis. In Mining text data . Springer, 415–463
work page 2012
-
[36]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[37]
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 8086–8098
work page 2022
-
[38]
Zeyang Ma, An Ran Chen, Dong Jae Kim, Tse-Hsun Chen, and Shaowei Wang. 2024. Llmparser: An exploratory study on using large language models for log parsing. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13
work page 2024
-
[39]
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing . 11048–11064
work page 2022
- [40]
-
[41]
Alessandro Murgia, Marco Ortu, Parastou Tourani, Bram Adams, and Serge Demeyer. 2018. An exploratory qualitative and quantitative analysis of emotions in issue report comments of open source systems.Empirical Software Engineering 23 (2018), 521–564
work page 2018
-
[42]
Nicole Novielli, Fabio Calefato, Davide Dongiovanni, Daniela Girardi, and Filippo Lanubile. 2020. Can we use se-specific sentiment analysis tools in a cross-platform setting?. In Proceedings of the 17th International Conference on Mining Software Repositories. 158–168
work page 2020
-
[43]
Nicole Novielli, Fabio Calefato, Filippo Lanubile, and Alexander Serebrenik. 2021. Assessment of off-the-shelf SE-specific sentiment analysis tools: An extended replication study. Empirical Software Engineering 26, 4 (2021), 77
work page 2021
-
[44]
Nicole Novielli, Daniela Girardi, and Filippo Lanubile. 2018. A benchmark study on sentiment analysis for software engineering research. In Proceedings of the 15th International Conference on Mining Software Repositories . 364–375
work page 2018
-
[45]
Martin Obaidi, Lukas Nagel, Alexander Specht, and Jil Klünder. 2022. Sentiment analysis tools in software engineering: A systematic mapping study. Information and Software Technology (2022), 107018
work page 2022
-
[46]
Marco Ortu, Alessandro Murgia, Giuseppe Destefanis, Parastou Tourani, Roberto Tonelli, Michele Marchesi, and Bram Adams. 2016. The emotional side of software developers in JIRA. In Proceedings of the 13th international conference on mining software repositories. 480–483
work page 2016
-
[47]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830
work page 2011
-
[48]
Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True few-shot learning with language models. Advances in neural information processing systems 34 (2021), 11054–11070
work page 2021
-
[49]
Anuja Priyam, Gupta R Abhijeeta, Anju Rathee, and Saurabh Srivastava. 2013. Comparative analysis of decision tree classification algorithms. International Journal of current engineering and technology 3, 2 (2013), 334–337
work page 2013
-
[50]
Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems . 1–7
work page 2021
-
[51]
Irina Rish et al. 2001. An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence, Vol. 3. 41–46
work page 2001
-
[52]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[53]
Klaus R Scherer, Tanja Wranik, Janique Sangsue, Véronique Tran, and Ursula Scherer. 2004. Emotions in everyday life: Probability of occurrence, risk factors, appraisal and reaction patterns. Social Science Information 43, 4 (2004), 499–570
work page 2004
-
[54]
Anna Schmidt and Michael Wiegand. 2017. A survey on hate speech detection using natural language processing. In Proceedings of the fifth international workshop on natural language processing for social media . 1–10
work page 2017
-
[55]
Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[56]
Nan Song, Hongjie Cai, Rui Xia, Jianfei Yu, Zhen Wu, and Xinyu Dai. 2023. A Sequence-to-Structure Approach to Document-level Targeted Sentiment Analysis. In Findings of the Association for Computational Linguistics: EMNLP 2023 . J. ACM, Vol. 37, No. 4, Article 1. Publication date: September 2024. Revisiting Sentiment Analysis for Software Engineering in t...
work page 2023
-
[57]
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [58]
-
[59]
Parastou Tourani, Yujuan Jiang, and Bram Adams. 2014. Monitoring sentiment in open source mailing lists: exploratory study on the apache ecosystem.. In CASCON, Vol. 14. 34–44
work page 2014
-
[60]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)
work page 2017
-
[63]
Lorenzo Villarroel, Gabriele Bavota, Barbara Russo, Rocco Oliveto, and Massimiliano Di Penta. 2016. Release planning of mobile apps based on user reviews. In Proceedings of the 38th International Conference on Software Engineering . 14–24
work page 2016
-
[64]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al . 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations . 38–45
work page 2020
-
[65]
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Junjielong Xu, Ziang Cui, Yuan Zhao, Xu Zhang, Shilin He, Pinjia He, Liqun Li, Yu Kang, Qingwei Lin, Yingnong Dang, et al. 2024. UniLog: Automatic Logging via LLM and In-Context Learning. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering . 1–12
work page 2024
-
[67]
Yiyan Xu, Wenjie Wang, Fuli Feng, Yunshan Ma, Jizhi Zhang, and Xiangnan He. 2024. Diffusion Models for Generative Outfit Recommendation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1350–1359
work page 2024
-
[68]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019)
work page 2019
-
[69]
Peiyan Zhang, Yuchen Yan, Xi Zhang, Liying Kang, Chaozhuo Li, Feiran Huang, Senzhang Wang, and Sunghun Kim
-
[70]
GPT4Rec: Graph Prompt Tuning for Streaming Recommendation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval . 1774–1784
-
[71]
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[72]
Ting Zhang, Divya Prabha Chandrasekaran, Ferdian Thung, and David Lo. 2022. Benchmarking library recognition in tweets. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension . 343–353
work page 2022
-
[73]
Ting Zhang, Bowen Xu, Ferdian Thung, Stefanus Agus Haryono, David Lo, and Lingxiao Jiang. 2020. Sentiment analysis for software engineering: How far can pre-trained transformer models go?. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME) . IEEE, 70–80
work page 2020
- [74]
-
[75]
Wenxuan Zhang, Xin Li, Yang Deng, Lidong Bing, and Wai Lam. 2022. A survey on aspect-based sentiment analysis: Tasks, methods, and challenges. IEEE Transactions on Knowledge and Data Engineering (2022)
work page 2022
-
[76]
Yingying Zhang and Daqing Hou. 2013. Extracting problematic API features from forum discussions. In 2013 21st International Conference on Program Comprehension (ICPC) . IEEE, 142–151
work page 2013
-
[77]
Xin Zhou, Ting Zhang, and David Lo. 2024. Large Language Model for Vulnerability Detection: Emerging Results and Future Directions. In Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results (Lisbon, Portugal) (ICSE-NIER’24). Association for Computing Machinery, New York, NY, USA, 47–51. https...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.