arxiv: 2604.10834 · v1 · submitted 2026-04-12 · 💻 cs.SE · cs.AI

Recognition: unknown

LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments

Maria Camporese , Fabio Massacci , Yuanjun Gong

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:05 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords large language modelsqualitative data analysissecurity analysisthematic codinghuman experimentsannotationCohen's Kappacode vulnerabilities

0 comments

The pith

LLMs cannot reliably annotate security-specific comments from human code analysis experiments

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can replace human experts when coding free-text comments from experiments in which subjects review vulnerable code snippets. Four leading LLMs were prompted to detect nine security-relevant codes such as mentions of code identifiers, lines of code, and security keywords. Outputs were scored against human annotations with Cohen's Kappa under several prompt styles that included detailed codebooks and examples. Marked gains appeared only with detailed code descriptions, yet these gains varied by code and stayed too low for LLMs to stand in for human annotators. The results indicate that current models lack the contextual understanding required for technical security qualitative tasks.

Core claim

The authors find that large language models produce annotations for security aspects in experiment comments that match human experts only modestly, with clear but non-uniform gains from detailed code descriptions that nevertheless remain insufficient to replace human annotators as measured by Cohen's Kappa.

What carries the argument

Prompt variations that add detailed code descriptions and examples, evaluated by Cohen's Kappa agreement with human annotations on nine security codes extracted from free-text comments.

Load-bearing premise

The nine security-relevant codes and the chosen human annotations form a sufficient test of LLM capability for qualitative security analysis tasks in general.

What would settle it

An experiment in which one or more LLMs achieve high and uniform Cohen's Kappa scores with humans across all nine codes under comparable prompting conditions.

Figures

Figures reproduced from arXiv: 2604.10834 by Fabio Massacci, Maria Camporese, Yuanjun Gong.

**Figure 1.** Figure 1: Annotation by human annotators and LLMs (a) The annotators get an initial understanding from the background information and the codebook. The annotators individually mark a large fraction of their comments1 . (b) At this point, a formal definition of the codebook is agreed (Column 1 and 2 of Table1). (2) Completed annotation with codebook and first review (a) The annotators get a deeper understanding of t… view at source ↗

**Figure 3.** Figure 3: The defined prompt P3+ based on P3𝑝𝑖𝑙𝑜𝑡 Pilot Experiment. For the exploration of the initial design of the prompts, we employed OpenAI’s GPT-4o model [18] (interactive version) to conduct the pilot study for the prompts refinement. We chose GPT-4o since this model is not used in the main experiments and can therefore reduce the bias. The output of each turn from the pilot LLM is manually reviewed for insig… view at source ↗

**Figure 2.** Figure 2: The pilot interaction where P3𝑝𝑖𝑙𝑜𝑡 is constructed on other comments. Few-shot examples are considered a significant help for the LLMs, but they are not considered a good practice for human annotators, as they can be a source of bias and lack of independence. P3𝑝𝑖𝑙𝑜𝑡 : The prompt asks for conflicting examples and provides clarifications from the researcher.P3𝑝𝑖𝑙𝑜𝑡 asks the pilot LLM to go through the huma… view at source ↗

**Figure 4.** Figure 4: Average Cohen’s 𝜅 (Chance Corrected Accuracy if human were ground truth) vs Effort (Size of Prompts) (a) GPT-5 (b) Claude-4 (c) DeepSeek-V3.2 (d) Qwen3-Max [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Chance corrected accuracy against the required effort in tokens, with each curve corresponding to a different code. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

[Background:] Thematic analysis of free-text justifications in human experiments provides significant qualitative insights. Yet, it is costly because reliable annotations require multiple domain experts. Large language models (LLMs) seem ideal candidates to replace human annotators. [Problem:] Coding security-specific aspects (code identifiers mentioned, lines-of-code mentioned, security keywords mentioned) may require deeper contextual understanding than sentiment classification. [Objective:] Explore whether LLMs can act as automated annotators for technical security comments by human subjects. [Method:] We prompt four top-performing LLMs on LiveBench to detect nine security-relevant codes in free-text comments by human subjects analyzing vulnerable code snippets. Outputs are compared to human annotators using Cohen's Kappa (chance-corrected accuracy). We test different prompts mimicking annotation best practices, including emerging codes, detailed codebooks with examples, and conflicting examples. [Negative Results:] We observed marked improvements only when using detailed code descriptions; however, these improvements are not uniform across codes and are insufficient to reliably replace a human annotator. [Limitations:] Additional studies with more LLMs and annotation tasks are needed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript examines whether LLMs can serve as automated annotators for nine security-relevant codes in free-text comments from human subjects analyzing vulnerable code snippets. It evaluates four top LLMs using Cohen's Kappa against human annotations, testing prompting strategies that include detailed codebooks, examples, and conflicting cases. The central finding is that detailed code descriptions yield marked but non-uniform improvements, which remain insufficient to reliably replace human annotators.

Significance. If the results hold, the work supplies concrete empirical evidence on the limits of LLMs for domain-specific qualitative coding tasks in security research, where contextual understanding of code identifiers, lines, and keywords is required. The study is strengthened by its use of multiple LLMs, systematic variation of prompts that mimic annotation best practices, and direct chance-corrected comparison to independent human labels.

major comments (2)

The claim that improvements 'are insufficient to reliably replace a human annotator' (abstract and conclusion) is load-bearing yet unanchored. No human inter-annotator agreement figures (e.g., pairwise Cohen's Kappa among the human coders) are reported, nor is any explicit threshold or parity criterion defined for 'reliable' replacement. Without this benchmark, the sufficiency judgment cannot be evaluated quantitatively.
Method section: the number of free-text comments, number of human subjects, and total annotation instances are not stated in the provided abstract or method summary. These quantities are required to assess the statistical power of the reported Kappa values and the robustness of the negative result across codes.

minor comments (2)

Abstract: the four LLMs are described only as 'top-performing on LiveBench'; naming them and providing at least one full prompt template would improve reproducibility.
The limitations paragraph could explicitly note the absence of a human-human baseline as a direction for follow-up work.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of our work. We address each major comment below.

read point-by-point responses

Referee: The claim that improvements 'are insufficient to reliably replace a human annotator' (abstract and conclusion) is load-bearing yet unanchored. No human inter-annotator agreement figures (e.g., pairwise Cohen's Kappa among the human coders) are reported, nor is any explicit threshold or parity criterion defined for 'reliable' replacement. Without this benchmark, the sufficiency judgment cannot be evaluated quantitatively.

Authors: We agree that the claim requires better anchoring with an explicit criterion. In the revision we will define 'reliable replacement' using standard interpretations of Cohen's Kappa (substantial agreement at Kappa ≥ 0.61). We will also add a limitations paragraph stating that human inter-annotator agreement was not computed, as the study used expert annotations without pairwise independent coding. These changes will be made in the abstract, conclusion, and a new limitations subsection. revision: yes
Referee: Method section: the number of free-text comments, number of human subjects, and total annotation instances are not stated in the provided abstract or method summary. These quantities are required to assess the statistical power of the reported Kappa values and the robustness of the negative result across codes.

Authors: We thank the referee for noting this clarity issue. The full Method section already contains these quantities. In the revised manuscript we will add an explicit summary paragraph at the start of the Method section stating the number of human subjects, free-text comments, and total annotation instances to facilitate evaluation of statistical power and robustness. revision: yes

standing simulated objections not resolved

Human inter-annotator agreement figures (pairwise Cohen's Kappa) are not available, as the original study design did not include multiple independent annotations per comment for IAA calculation.

Circularity Check

0 steps flagged

No circularity: pure empirical comparison to independent human annotations

full rationale

The paper conducts an empirical study: it prompts four LLMs on security comment data, applies different prompt variants (codebooks, examples), and measures agreement with separate human annotations via Cohen's Kappa. No equations, fitted parameters, derived predictions, or self-referential derivations appear. The central claim rests on observed performance differences against an external human benchmark, not on any reduction to the paper's own inputs or prior self-citations. This matches the default case of a self-contained empirical evaluation with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on treating human annotations as ground truth and Cohen's Kappa as the right agreement metric; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Human annotations provide the correct ground truth labels for security codes.
Standard assumption in annotation studies where experts define the reference labels.
standard math Cohen's Kappa is an appropriate chance-corrected measure of agreement for this task.
Widely accepted statistical tool for inter-rater reliability in qualitative coding.

pith-pipeline@v0.9.0 · 5491 in / 1174 out tokens · 112896 ms · 2026-05-10T15:05:00.759229+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 11 canonical work pages · 6 internal anchors

[1]

Toufique Ahmed, Amiangshu Bosu, Anindya Iqbal, and Shahram Rahimi. 2017. SentiCR: A customized sentiment analysis tool for code review interactions. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 106–111

2017
[2]

Anthropic. 2025. Claude 4: Claude Opus 4 & Claude Sonnet 4 Large Language Models. System card / technical documentation. https://www.anthropic.com/ news/claude-4 Includes Opus 4 and Sonnet 4, released May 22, 2025

2025
[3]

Andreas Armborst. 2017. Thematic proximity in content analysis.Sage Open7, 2 (2017), 2158244017707797

2017
[4]

Anthony G Barnston. 1992. Correspondence among the correlation, RMSE, and Heidke forecast verification measures; refinement of the Heidke score.Weather and Forecasting7, 4 (1992), 699–709

1992
[5]

Avik Bhattacharjee. 2025. Introduction to Amazon Bedrock. InA Practical Guide to Generative AI Using Amazon Bedrock: Building, Deploying, and Securing Generative AI Applications. Springer, 79–112

2025
[6]

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research3, Jan (2003), 993–1022

2003
[7]

Fabio Calefato, Filippo Lanubile, Federico Maiorano, and Nicole Novielli. 2018. Sentiment polarity detection for software development. InProceedings of the 40th International Conference on Software Engineering. 128–128

2018
[8]

Fabio Calefato, Filippo Lanubile, and Nicole Novielli. 2017. Emotxt: a toolkit for emotion recognition from text. In2017 seventh international conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). IEEE, 79–80

2017
[9]

Victoria Clarke and Virginia Braun. 2017. Thematic analysis.The journal of positive psychology12, 3 (2017), 297–298

2017
[10]

Guilherme A Maldonado da Cruz, Elisa Hatsue Moriya Huzita, and Valéria D Feltrim. 2016. Estimating Trust in Virtual Teams-A Framework based on Sen- timent Analysis. InInternational Conference on Enterprise Information Systems, Vol. 2. SciTePress, 464–471

2016
[11]

Shih-Chieh Dai, Aiping Xiong, and Lun-Wei Ku. 2023. LLM-in-the-loop: Leverag- ing large language model for thematic analysis.arXiv preprint arXiv:2310.15100 (2023)

work page arXiv 2023
[12]

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Zackary Okun Dunivin. 2024. Scalable qualitative coding with llms: Chain-of- thought reasoning matches human performance in some hermeneutic tasks. arXiv preprint arXiv:2401.15170(2024)

work page arXiv 2024
[14]

Denae Ford, Justin Smith, Philip J Guo, and Chris Parnin. 2016. Paradise un- plugged: Identifying barriers for female participation on stack overflow. InPro- ceedings of the 2016 24th ACM SIGSOFT International symposium on foundations of software engineering. 846–857

2016
[15]

Robert Wayne Gregory, Mark Keil, Jan Muntermann, and Magnus Mähring
[16]

Information Systems Research26, 1 (2015), 57–80

Paradoxes and the nature of ambidexterity in IT transformation programs. Information Systems Research26, 1 (2015), 57–80

2015
[17]

2011.Applied thematic analysis

Greg Guest, Kathleen M MacQueen, and Emily E Namey. 2011.Applied thematic analysis. sage publications

2011
[18]

Wang, and Xinyu Xing

Wenbo Guo, Dongliang Mu, Jun Xu, Purui Su, G. Wang, and Xinyu Xing. 2018. LEMNA: Explaining Deep Learning based Security Applications.Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (2018)

2018
[19]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Md Rakibul Islam and Minhaz F Zibran. 2018. DEVA: sensing emotions in the valence arousal space in software engineering text. InProceedings of the 33rd annual ACM symposium on applied computing. 1536–1543

2018
[21]

Nishant Jha and Anas Mahmoud. 2018. Using frame semantics for classifying and summarizing application store reviews.Empirical Software Engineering23, 6 (2018), 3734–3767

2018
[22]

Nishant Jha and Anas Mahmoud. 2019. Mining non-functional requirements from app store reviews.Empirical Software Engineering24 (2019), 3659–3695

2019
[23]

Jing Jiang, David Lo, Jiahuan He, Xin Xia, Pavneet Singh Kochhar, and Li Zhang
[24]

Why and how developers fork what from whom in GitHub.Empirical Software Engineering22, 1 (2017), 547–578

2017
[25]

Brittany Johnson, Rahul Pandita, Justin Smith, Denae Ford, Sarah Elder, Emerson Murphy-Hill, Sarah Heckman, and Caitlin Sadowski. 2016. A cross-tool com- munication study on program analysis tool notifications. InProceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 73–84

2016
[26]

Robbert Jongeling, Proshanta Sarkar, Subhajit Datta, and Alexander Serebrenik
[27]

On negative results when using sentiment analysis tools for software engineering research.Empirical Software Engineering22 (2017), 2543–2584

2017
[28]

Rafael Kallis, Andrea Di Sorbo, Gerardo Canfora, and Sebastiano Panichella
[29]

In2019 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Ticket tagger: Machine learning driven issue classification. In2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 406–409
[30]

Kevin Danis Li, Adrian M Fernandez, Rachel Schwartz, Natalie Rios, Mar- vin Nathaniel Carlisle, Gregory M Amend, Hiren V Patel, and Benjamin N Breyer
[31]

Comparing GPT-4 and human researchers in health care data analysis: qualitative description study.Journal of Medical Internet Research26 (2024), e56500

2024
[32]

Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. Vuldeepecker: A deep learning-based system for vulnerability detection.arXiv preprint arXiv:1801.01681(2018)

work page arXiv 2018
[33]

Bin Lin, Nathan Cassee, Alexander Serebrenik, Gabriele Bavota, Nicole Novielli, and Michele Lanza. 2022. Opinion mining for software development: a systematic literature review.ACM Transactions on Software Engineering and Methodology (TOSEM)31, 3 (2022), 1–41

2022
[34]

Petter Mæhlum, David Samuel, Rebecka Maria Norman, Elma Jelin, Øyvind An- dresen Bjertnæs, Lilja Øvrelid, and Erik Velldal. 2024. It’s Difficult to be Neutral– Human and LLM-based Sentiment Annotation of Patient Comments.arXiv preprint arXiv:2404.18832(2024)

work page arXiv 2024
[35]

Mika V Mäntylä and Casper Lassenius. 2008. What types of defects are really discovered in code reviews?IEEE Transactions on Software Engineering35, 3 (2008), 430–448

2008
[36]

Ggaliwango Marvin, Nakayiza Hellen, Daudi Jjingo, and Joyce Nakatumba- Nabende. 2023. Prompt engineering in large language models. InInternational conference on data intelligence and cognitive informatics. Springer, 387–402

2023
[37]

Meta AI. 2024. The LLaMA 3 Herd of Models.arXiv preprint(2024). https: //arxiv.org Most recent LLaMA model family at time of writing

2024
[38]

Mistral AI. 2024. Mixtral of Experts.arXiv preprint(2024). https://arxiv.org Sparse mixture-of-experts language model

2024
[39]

OpenAI. 2025. GPT-5: Technical Overview and Capabilities. Model documenta- tion and release notes. https://openai.com Large multimodal foundation model released by OpenAI

2025
[40]

OpenAI. 2025. OpenAI o3: Reasoning-Focused Large Language Model. Technical documentation and model card. https://openai.com Reasoning-optimized model in the OpenAI o-series

2025
[41]

Aurora Papotti, Ranindya Paramitha, and Fabio Massacci. 2024. On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools.Empirical Software Engineering29, 5 (2024), 1–35

2024
[42]

Aurora Papotti, Katja Tuma, and Fabio Massacci. 2025. On the effects of program slicing for vulnerability detection during code inspection.Empirical Software Engineering30, 3 (2025), 1–37

2025
[43]

Mohammad Masudur Rahman, Chanchal K Roy, and Raula G Kula. 2017. Predict- ing usefulness of code review comments using textual features and developer experience. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 215–226

2017
[44]

Zeeshan Rasheed, Muhammad Waseem, Aakash Ahmad, Kai-Kristian Kemell, Wang Xiaofeng, Anh Nguyen Duc, and Pekka Abrahamsson. 2024. Can large language models serve as data analysts? a multi-agent assisted approach for qualitative data analysis.arXiv preprint arXiv:2402.01386(2024)

work page arXiv 2024
[45]

Per Runeson and Martin Höst. 2009. Guidelines for conducting and reporting case study research in software engineering.Empirical software engineering14 ICPC ’26, April 12–13, 2026, Rio de Janeiro, Brazil Maria Camporese et al. (2009), 131–164

2009
[46]

Antonino Sabetta, Serena Elisa Ponta, Rocio Cabrera Lozoya, Michele Bezzi, Tommaso Sacchetti, Matteo Greco, Gergő Balogh, Péter Hegedűs, Rudolf Ferenc, Ranindya Paramitha, et al. 2024. Known vulnerabilities of open source projects: Where are the fixes?IEEE Security & Privacy22, 2 (2024), 49–59

2024
[47]

Hope Schroeder, Marianne Aubin Le Quéré, Casey Randazzo, David Mimno, and Sarita Schoenebeck. 2025. Large Language Models in Qualitative Research: Uses, Tensions, and Intentions. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–17

2025
[48]

Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, et al
[49]

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

The prompt report: a systematic survey of prompt engineering techniques. arXiv preprint arXiv:2406.06608(2024)

work page internal anchor Pith review arXiv 2024
[50]

Carolyn B. Seaman. 1999. Qualitative methods in empirical studies of software engineering.IEEE Transactions on software engineering25, 4 (1999), 557–572

1999
[51]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al . 2025. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534(2025)

work page internal anchor Pith review arXiv 2025
[52]

Mike Thelwall, Kevan Buckley, Georgios Paltoglou, Di Cai, and Arvid Kappas
[53]

Sentiment strength detection in short informal text.Journal of the American society for information science and technology61, 12 (2010), 2544–2558

2010
[54]

Matthijs J Warrens. 2015. Five ways to look at Cohen’s kappa.Journal of Psychology & Psychotherapy5 (2015)

2015
[55]

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, et al. 2024. LiveBench: A challenging, contamination-limited LLM benchmark.arXiv preprint arXiv:2406.19314(2024)

work page internal anchor Pith review arXiv 2024
[56]

2012.Experimentation in software engineering

Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012.Experimentation in software engineering. Vol. 236. Springer

2012
[57]

xAI. 2024. Grok: A Large Language Model by xAI. Model documentation and release announcement. https://x.ai Latest publicly released Grok model family at time of writing

2024
[58]

Ziang Xiao, Xingdi Yuan, Q Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer
[59]

InCompanion proceedings of the 28th international conference on intelligent user interfaces

Supporting qualitative analysis with large language models: Combining codebook with GPT-3 for deductive coding. InCompanion proceedings of the 28th international conference on intelligent user interfaces. 75–78
[60]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding con- volutional networks. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13. Springer, 818–833

2014
[62]

Yan Zhang and Barbara M Wildemuth. 2009. Qualitative analysis of content. Applications of social research methods to questions in information and library science308, 319 (2009), 1–12

2009
[63]

Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing twitter and traditional media using topic models. InAdvances in Information Retrieval: 33rd European Conference on IR Research, ECIR 2011, Dublin, Ireland, April 18-21, 2011. Proceedings 33. Springer, 338–349

2011