pith. sign in

arxiv: 2310.11113 · v3 · submitted 2023-10-17 · 💻 cs.SE

Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models

Pith reviewed 2026-05-24 06:10 UTC · model grok-4.3

classification 💻 cs.SE
keywords sentiment analysissoftware engineeringlarge language modelszero-shot learningfew-shot learningfine-tuningimbalanced datasetsuser feedback
0
0 comments X

The pith

Bigger large language models outperform fine-tuned smaller ones on software engineering sentiment tasks with limited or imbalanced data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether bigger large language models can solve the labeled data shortage that limits smaller models when analyzing sentiments in software engineering texts such as user reviews, comments, and forum posts. It runs three open-source bigger models in zero-shot and few-shot modes and pits them against fine-tuned smaller models on five established datasets. The results show bigger models reach top performance on small or skewed datasets and work well with no training examples at all. Smaller models still win when training data is plentiful and the classes are balanced. A reader would care because sentiment signals from stakeholders directly inform software maintenance and feature decisions, so the right model choice changes how much manual labeling effort is required.

Core claim

Our experimental findings demonstrate that bLLMs exhibit state-of-the-art performance on datasets marked by limited training data and imbalanced distributions. bLLMs can also achieve excellent performance under a zero-shot setting. However, when ample training data is available or the dataset exhibits a more balanced distribution, fine-tuned sLLMs can still achieve superior results.

What carries the argument

Direct comparison of bigger LLMs in zero-shot and few-shot settings against fine-tuned smaller LLMs on five software engineering sentiment datasets.

If this is right

  • Bigger LLMs set the performance standard for sentiment analysis on small or imbalanced software engineering datasets.
  • Zero-shot use of bigger LLMs becomes a practical option for these tasks without any labeled examples.
  • Fine-tuned smaller LLMs retain the advantage when training data is both abundant and balanced.
  • Practitioners should choose the model type according to the volume and balance of available data rather than defaulting to one approach.
  • Reliance on large-scale manual labeling for software engineering sentiment tasks can be reduced in many common cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams could begin with zero-shot bigger LLMs for immediate feedback on new platforms and later fine-tune smaller models once labeled data accumulates.
  • The same data-volume and balance logic may apply to other software engineering classification tasks such as bug severity or requirement prioritization.
  • Results on proprietary or streaming software engineering data would provide a stronger test than the five public datasets alone.
  • Variations in prompting or example selection could further improve bigger LLM results without additional training data.

Load-bearing premise

The five established datasets are representative of real-world software engineering sentiment tasks and the zero-shot, few-shot, and fine-tuning setups for bigger and smaller LLMs are implemented under comparable conditions without undisclosed advantages.

What would settle it

A controlled test on a sixth large balanced software engineering sentiment dataset in which fine-tuned smaller LLMs fail to outperform the bigger LLMs, or in which the bigger LLM results show clear but hidden implementation advantages.

Figures

Figures reproduced from arXiv: 2310.11113 by David Lo, Ferdian Thung, Ivana Clairine Irsan, Ting Zhang.

Figure 1
Figure 1. Figure 1: The zero-shot prompt templates we utilized when running [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Few-shot prompt template (with 𝑘 = 1) utilized by Llama 2-Chat [60]. limited number of examples available. In contrast to zero-shot learning, few-shot learning incorporates an extra “Demonstration” component. While previous studies, such as Zhang et al. [72], indicate that few-shot learning may surpass zero-shot learning in certain aspects, contrasting findings by Reynolds et al. [49] suggest that zero-sho… view at source ↗
Figure 3
Figure 3. Figure 3: One example to get the prediction probability scores from the bLLMs. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity of different prompt designs. The circles depicted in the figure represent outlier data points. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of the highest macro-F1 and micro-F1 scores achieved through zero-shot learning and [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance variance of all the models on each dataset. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The Venn diagram of the correct predictions made by bLLMs. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Software development involves collaborative interactions where stakeholders express opinions across various platforms. Recognizing the sentiments conveyed in these interactions is crucial for the effective development and ongoing maintenance of software systems. For software products, analyzing the sentiment of user feedback, e.g., reviews, comments, and forum posts can provide valuable insights into user satisfaction and areas for improvement. This can guide the development of future updates and features. However, accurately identifying sentiments in software engineering datasets remains challenging. This study investigates bigger large language models (bLLMs) in addressing the labeled data shortage that hampers fine-tuned smaller large language models (sLLMs) in software engineering tasks. We conduct a comprehensive empirical study using five established datasets to assess three open-source bLLMs in zero-shot and few-shot scenarios. Additionally, we compare them with fine-tuned sLLMs, using sLLMs to learn contextual embeddings of text from software platforms. Our experimental findings demonstrate that bLLMs exhibit state-of-the-art performance on datasets marked by limited training data and imbalanced distributions. bLLMs can also achieve excellent performance under a zero-shot setting. However, when ample training data is available or the dataset exhibits a more balanced distribution, fine-tuned sLLMs can still achieve superior results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents an empirical study comparing three open-source bigger large language models (bLLMs) in zero-shot and few-shot settings against fine-tuned smaller large language models (sLLMs) for sentiment analysis on five established software engineering datasets. It claims that bLLMs achieve state-of-the-art performance on datasets with limited training data and imbalanced distributions (including zero-shot), while fine-tuned sLLMs outperform when ample balanced training data is available.

Significance. If the experimental protocols are shown to be comparable, the results would provide actionable guidance for SE practitioners facing labeled-data scarcity in sentiment tasks, clarifying when scale (bLLMs) versus task-specific fine-tuning (sLLMs) is preferable.

major comments (3)
  1. [§4] §4 (Experimental Setup): The exact prompt templates, example-selection criteria for few-shot, and output-parsing rules for the bLLM zero/few-shot evaluations are not reported. This detail is load-bearing for the central claim, because any dataset-specific phrasing or post-processing could confer an undisclosed advantage relative to the standardized fine-tuning protocol used for sLLMs.
  2. [Table 1, §5.1] Table 1 (Dataset Characteristics) and §5.1 (Results per Dataset): Training-set sizes, class-balance ratios, and the precise train/test splits are not tabulated or statistically tested. Without these, it is impossible to verify the headline distinction that bLLMs win on “limited training data and imbalanced distributions” while sLLMs win on “ample balanced data.”
  3. [§5.2] §5.2 (Comparison Protocol): No description is given of whether identical preprocessing, tokenization, or label-mapping pipelines were applied to both model families, nor whether any post-hoc filtering of predictions occurred. This omission directly affects the validity of attributing performance gaps to model scale alone.
minor comments (2)
  1. [Abstract, §3] The abstract and §3 refer to “five established datasets” without naming them until later; an early table listing the datasets, their sources, and sizes would improve readability.
  2. [Figures] Figure captions and axis labels in the performance plots use inconsistent abbreviations for model names; standardize to the nomenclature introduced in §4.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of reproducibility and experimental transparency. We address each major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Setup): The exact prompt templates, example-selection criteria for few-shot, and output-parsing rules for the bLLM zero/few-shot evaluations are not reported. This detail is load-bearing for the central claim, because any dataset-specific phrasing or post-processing could confer an undisclosed advantage relative to the standardized fine-tuning protocol used for sLLMs.

    Authors: We agree that these implementation details are critical for reproducibility and for validating the central claims. In the revised manuscript, we will expand §4 to include the exact prompt templates used for each bLLM in both zero-shot and few-shot settings, the criteria for selecting few-shot examples (including whether selection was random, stratified by class, or based on other heuristics), and the precise rules for parsing model outputs into sentiment labels. These additions will enable direct comparison with the sLLM fine-tuning protocol. revision: yes

  2. Referee: [Table 1, §5.1] Table 1 (Dataset Characteristics) and §5.1 (Results per Dataset): Training-set sizes, class-balance ratios, and the precise train/test splits are not tabulated or statistically tested. Without these, it is impossible to verify the headline distinction that bLLMs win on “limited training data and imbalanced distributions” while sLLMs win on “ample balanced data.”

    Authors: We acknowledge that the current Table 1 and §5.1 lack sufficient quantitative detail on dataset characteristics. We will revise Table 1 to report training-set sizes, class-balance ratios (e.g., proportions of positive/negative/neutral), and the exact train/test split ratios or indices used for each of the five datasets. We will also add a short statistical summary or discussion in §5.1 to support the distinction between limited/imbalanced versus ample/balanced data regimes. revision: yes

  3. Referee: [§5.2] §5.2 (Comparison Protocol): No description is given of whether identical preprocessing, tokenization, or label-mapping pipelines were applied to both model families, nor whether any post-hoc filtering of predictions occurred. This omission directly affects the validity of attributing performance gaps to model scale alone.

    Authors: We will clarify the comparison protocol in the revised §5.2. Identical preprocessing steps (text normalization, removal of URLs/special characters, and lowercasing where applicable) and the same label-mapping rules were applied to both bLLM and sLLM pipelines; we will state this explicitly. Tokenization is inherently model-specific and will be noted as such. No post-hoc filtering or selective discarding of predictions was performed on either model family. These details will be added to ensure the attribution of performance differences is transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical comparison on fixed public datasets

full rationale

The paper performs an empirical evaluation of bLLMs vs. sLLMs on five established public datasets using zero-shot, few-shot, and fine-tuning protocols. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All performance claims are measured directly against held-out labels, satisfying the self-contained benchmark criterion. No load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Empirical study with no mathematical derivations. Relies on standard machine-learning evaluation assumptions such as dataset representativeness and fair experimental conditions.

axioms (2)
  • domain assumption The five established datasets are representative of software engineering sentiment analysis tasks.
    Central to generalizing the performance claims beyond the specific collections tested.
  • domain assumption Zero-shot and few-shot prompting for bLLMs and fine-tuning procedures for sLLMs are implemented without hidden advantages or inconsistent preprocessing.
    Required for the direct performance comparison to be valid.

pith-pipeline@v0.9.0 · 5759 in / 1339 out tokens · 28745 ms · 2026-05-24T06:10:21.499409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 10 internal anchors

  1. [1]

    Toufique Ahmed, Amiangshu Bosu, Anindya Iqbal, and Shahram Rahimi. 2017. SentiCR: A customized sentiment analysis tool for code review interactions. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 106–111

  2. [2]

    Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl Barr. 2024. Automatic semantic augmentation of language model prompts (for code summarization). In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  3. [3]

    Eeshita Biswas, Mehmet Efruz Karabulut, Lori Pollock, and K Vijay-Shanker. 2020. Achieving reliable sentiment analysis in the software engineering domain using bert. In 2020 IEEE International conference on software maintenance and evolution (ICSME). IEEE, 162–173

  4. [4]

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

  5. [5]

    Leo Breiman. 2001. Random forests. Machine learning 45 (2001), 5–32

  6. [6]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

  7. [7]

    Fabio Calefato, Filippo Lanubile, Federico Maiorano, and Nicole Novielli. 2018. Sentiment polarity detection for software development. In Proceedings of the 40th International Conference on Software Engineering . 128–128. J. ACM, Vol. 37, No. 4, Article 1. Publication date: September 2024. Revisiting Sentiment Analysis for Software Engineering in the Era ...

  8. [8]

    Fabio Calefato, Filippo Lanubile, Nicole Novielli, and Luigi Quaranta. 2019. Emtk-the emotion mining toolkit. In 2019 IEEE/ACM 4th International Workshop on Emotion A wareness in Software Engineering (SEmotion). IEEE, 34–37

  9. [9]

    Ning Chen, Steven CH Hoi, Shaohua Li, and Xiaokui Xiao. 2015. SimApp: A framework for detecting similar mobile applications by online kernel learning. In Proceedings of the eighth ACM international conference on web search and data mining. 305–314

  10. [10]

    Zhenpeng Chen, Yanbin Cao, Xuan Lu, Qiaozhu Mei, and Xuanzhe Liu. 2019. Sentimoji: an emoji-powered learning approach for sentiment analysis in software engineering. In Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering . 841–852

  11. [11]

    Zhenpeng Chen, Yanbin Cao, Huihan Yao, Xuan Lu, Xin Peng, Hong Mei, and Xuanzhe Liu. 2021. Emoji-powered sentiment and emotion detection from software developers’ communication data. ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 2 (2021), 1–48

  12. [12]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/

  13. [13]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022)

  14. [14]

    Xiang Deng, Vasilisa Bashlovkina, Feng Han, Simon Baumgartner, and Michael Bendersky. 2023. LLMs to the Moon? Reddit Market Sentiment Analysis with Large Language Models. In Companion Proceedings of the ACM Web Conference

  15. [15]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186

  16. [16]

    Zulfadzli Drus and Haliyana Khalid. 2019. Sentiment analysis in social media and its application: Systematic literature review. Procedia Computer Science 161 (2019), 707–714

  17. [17]

    Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, et al. [n. d.]. GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding. In Forty-first International Conference on Machine Learning

  18. [18]

    Cunxiao Du, Zhaopeng Tu, and Jing Jiang. 2021. Order-agnostic cross entropy for non-autoregressive machine translation. In International conference on machine learning . PMLR, 2849–2859

  19. [19]

    Tom Fawcett. 2006. An introduction to ROC analysis. Pattern recognition letters 27, 8 (2006), 861–874

  20. [20]

    Sidong Feng and Chunyang Chen. 2024. Prompting Is All You Need: Automated Android Bug Replay with Large Language Models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering . 1–13

  21. [21]

    Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Ge Li, Zhi Jin, Xiaoguang Mao, and Xiangke Liao

  22. [22]

    In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering

    Large language models are few-shot summarizers: Multi-intent comment generation via in-context learning. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering . 1–13

  23. [23]

    Mia Mohammad Imran, Yashasvi Jain, Preetha Chatterjee, and Kostadin Damevski. 2022. Data Augmentation for Improving Emotion Recognition in Software Engineering Communication. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering . 1–13

  24. [24]

    Ivana Clairine Irsan, Ting Zhang, Ferdian Thung, Kisub Kim, and David Lo. 2023. Multi-modal api recommendation. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 272–283

  25. [25]

    Md Rakibul Islam, Md Kauser Ahmmed, and Minhaz F Zibran. 2019. MarValous: Machine learning based detection of emotions in the valence-arousal space in software engineering text. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. 1786–1793

  26. [26]

    Md Rakibul Islam and Minhaz F Zibran. 2018. DEVA: sensing emotions in the valence arousal space in software engineering text. In Proceedings of the 33rd annual ACM symposium on applied computing . 1536–1543

  27. [27]

    Md Rakibul Islam and Minhaz F Zibran. 2018. SentiStrength-SE: Exploiting domain specificity for improved sentiment analysis in software engineering text. Journal of Systems and Software 145 (2018), 125–146

  28. [28]

    Robbert Jongeling, Subhajit Datta, and Alexander Serebrenik. 2015. Choosing your weapons: On sentiment analysis tools for software engineering research. In 2015 IEEE international conference on software maintenance and evolution (ICSME). IEEE, 531–535

  29. [29]

    Robbert Jongeling, Proshanta Sarkar, Subhajit Datta, and Alexander Serebrenik. 2017. On negative results when using sentiment analysis tools for software engineering research. Empirical Software Engineering 22 (2017), 2543–2584

  30. [30]

    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. (2020)

  31. [31]

    Bin Lin, Nathan Cassee, Alexander Serebrenik, Gabriele Bavota, Nicole Novielli, and Michele Lanza. 2022. Opinion mining for software development: a systematic literature review. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 3 (2022), 1–41. J. ACM, Vol. 37, No. 4, Article 1. Publication date: September 2024. 1:28 Zhang et al

  32. [32]

    Bin Lin, Fiorella Zampetti, Gabriele Bavota, Massimiliano Di Penta, Michele Lanza, and Rocco Oliveto. 2018. Sentiment analysis for software engineering: How far can we go?. In Proceedings of the 40th international conference on software engineering. 94–104

  33. [33]

    Bing Liu. 2020. Sentiment analysis: Mining opinions, sentiments, and emotions . Cambridge university press

  34. [34]

    Bing Liu et al. 2010. Sentiment analysis and subjectivity. Handbook of natural language processing 2, 2010 (2010), 627–666

  35. [35]

    Bing Liu and Lei Zhang. 2012. A survey of opinion mining and sentiment analysis. In Mining text data . Springer, 415–463

  36. [36]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  37. [37]

    Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 8086–8098

  38. [38]

    Zeyang Ma, An Ran Chen, Dong Jae Kim, Tse-Hsun Chen, and Shaowei Wang. 2024. Llmparser: An exploratory study on using large language models for log parsing. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  39. [39]

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing . 11048–11064

  40. [40]

    Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. 2023. State of what art? a call for multi-prompt llm evaluation. arXiv preprint arXiv:2401.00595 (2023)

  41. [41]

    Alessandro Murgia, Marco Ortu, Parastou Tourani, Bram Adams, and Serge Demeyer. 2018. An exploratory qualitative and quantitative analysis of emotions in issue report comments of open source systems.Empirical Software Engineering 23 (2018), 521–564

  42. [42]

    Nicole Novielli, Fabio Calefato, Davide Dongiovanni, Daniela Girardi, and Filippo Lanubile. 2020. Can we use se-specific sentiment analysis tools in a cross-platform setting?. In Proceedings of the 17th International Conference on Mining Software Repositories. 158–168

  43. [43]

    Nicole Novielli, Fabio Calefato, Filippo Lanubile, and Alexander Serebrenik. 2021. Assessment of off-the-shelf SE-specific sentiment analysis tools: An extended replication study. Empirical Software Engineering 26, 4 (2021), 77

  44. [44]

    Nicole Novielli, Daniela Girardi, and Filippo Lanubile. 2018. A benchmark study on sentiment analysis for software engineering research. In Proceedings of the 15th International Conference on Mining Software Repositories . 364–375

  45. [45]

    Martin Obaidi, Lukas Nagel, Alexander Specht, and Jil Klünder. 2022. Sentiment analysis tools in software engineering: A systematic mapping study. Information and Software Technology (2022), 107018

  46. [46]

    Marco Ortu, Alessandro Murgia, Giuseppe Destefanis, Parastou Tourani, Roberto Tonelli, Michele Marchesi, and Bram Adams. 2016. The emotional side of software developers in JIRA. In Proceedings of the 13th international conference on mining software repositories. 480–483

  47. [47]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830

  48. [48]

    Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True few-shot learning with language models. Advances in neural information processing systems 34 (2021), 11054–11070

  49. [49]

    Anuja Priyam, Gupta R Abhijeeta, Anju Rathee, and Saurabh Srivastava. 2013. Comparative analysis of decision tree classification algorithms. International Journal of current engineering and technology 3, 2 (2013), 334–337

  50. [50]

    Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems . 1–7

  51. [51]

    Irina Rish et al. 2001. An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence, Vol. 3. 41–46

  52. [52]

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

  53. [53]

    Klaus R Scherer, Tanja Wranik, Janique Sangsue, Véronique Tran, and Ursula Scherer. 2004. Emotions in everyday life: Probability of occurrence, risk factors, appraisal and reaction patterns. Social Science Information 43, 4 (2004), 499–570

  54. [54]

    Anna Schmidt and Michael Wiegand. 2017. A survey on hate speech detection using natural language processing. In Proceedings of the fifth international workshop on natural language processing for social media . 1–10

  55. [55]

    Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 (2020)

  56. [56]

    Nan Song, Hongjie Cai, Rui Xia, Jianfei Yu, Zhen Wu, and Xinyu Dai. 2023. A Sequence-to-Structure Approach to Document-level Targeted Sentiment Analysis. In Findings of the Association for Computational Linguistics: EMNLP 2023 . J. ACM, Vol. 37, No. 4, Article 1. Publication date: September 2024. Revisiting Sentiment Analysis for Software Engineering in t...

  57. [57]

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864 (2021)

  58. [58]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_ alpaca

  59. [59]

    Parastou Tourani, Yujuan Jiang, and Bram Adams. 2014. Monitoring sentiment in open source mailing lists: exploratory study on the apache ecosystem.. In CASCON, Vol. 14. 34–44

  60. [60]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  61. [61]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  62. [62]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)

  63. [63]

    Lorenzo Villarroel, Gabriele Bavota, Barbara Russo, Rocco Oliveto, and Massimiliano Di Penta. 2016. Release planning of mobile apps based on user reviews. In Proceedings of the 38th International Conference on Software Engineering . 14–24

  64. [64]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al . 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations . 38–45

  65. [65]

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023)

  66. [66]

    Junjielong Xu, Ziang Cui, Yuan Zhao, Xu Zhang, Shilin He, Pinjia He, Liqun Li, Yu Kang, Qingwei Lin, Yingnong Dang, et al. 2024. UniLog: Automatic Logging via LLM and In-Context Learning. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering . 1–12

  67. [67]

    Yiyan Xu, Wenjie Wang, Fuli Feng, Yunshan Ma, Jizhi Zhang, and Xiangnan He. 2024. Diffusion Models for Generative Outfit Recommendation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1350–1359

  68. [68]

    Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019)

  69. [69]

    Peiyan Zhang, Yuchen Yan, Xi Zhang, Liying Kang, Chaozhuo Li, Feiran Huang, Senzhang Wang, and Sunghun Kim

  70. [70]

    In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

    GPT4Rec: Graph Prompt Tuning for Streaming Recommendation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval . 1774–1784

  71. [71]

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)

  72. [72]

    Ting Zhang, Divya Prabha Chandrasekaran, Ferdian Thung, and David Lo. 2022. Benchmarking library recognition in tweets. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension . 343–353

  73. [73]

    Ting Zhang, Bowen Xu, Ferdian Thung, Stefanus Agus Haryono, David Lo, and Lingxiao Jiang. 2020. Sentiment analysis for software engineering: How far can pre-trained transformer models go?. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME) . IEEE, 70–80

  74. [74]

    Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, and Lidong Bing. 2023. Sentiment Analysis in the Era of Large Language Models: A Reality Check. arXiv preprint arXiv:2305.15005 (2023)

  75. [75]

    Wenxuan Zhang, Xin Li, Yang Deng, Lidong Bing, and Wai Lam. 2022. A survey on aspect-based sentiment analysis: Tasks, methods, and challenges. IEEE Transactions on Knowledge and Data Engineering (2022)

  76. [76]

    Yingying Zhang and Daqing Hou. 2013. Extracting problematic API features from forum discussions. In 2013 21st International Conference on Program Comprehension (ICPC) . IEEE, 142–151

  77. [77]

    Xin Zhou, Ting Zhang, and David Lo. 2024. Large Language Model for Vulnerability Detection: Emerging Results and Future Directions. In Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results (Lisbon, Portugal) (ICSE-NIER’24). Association for Computing Machinery, New York, NY, USA, 47–51. https...