pith. machine review for the scientific record. sign in

arxiv: 2604.10834 · v1 · submitted 2026-04-12 · 💻 cs.SE · cs.AI

Recognition: unknown

LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:05 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords large language modelsqualitative data analysissecurity analysisthematic codinghuman experimentsannotationCohen's Kappacode vulnerabilities
0
0 comments X

The pith

LLMs cannot reliably annotate security-specific comments from human code analysis experiments

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can replace human experts when coding free-text comments from experiments in which subjects review vulnerable code snippets. Four leading LLMs were prompted to detect nine security-relevant codes such as mentions of code identifiers, lines of code, and security keywords. Outputs were scored against human annotations with Cohen's Kappa under several prompt styles that included detailed codebooks and examples. Marked gains appeared only with detailed code descriptions, yet these gains varied by code and stayed too low for LLMs to stand in for human annotators. The results indicate that current models lack the contextual understanding required for technical security qualitative tasks.

Core claim

The authors find that large language models produce annotations for security aspects in experiment comments that match human experts only modestly, with clear but non-uniform gains from detailed code descriptions that nevertheless remain insufficient to replace human annotators as measured by Cohen's Kappa.

What carries the argument

Prompt variations that add detailed code descriptions and examples, evaluated by Cohen's Kappa agreement with human annotations on nine security codes extracted from free-text comments.

Load-bearing premise

The nine security-relevant codes and the chosen human annotations form a sufficient test of LLM capability for qualitative security analysis tasks in general.

What would settle it

An experiment in which one or more LLMs achieve high and uniform Cohen's Kappa scores with humans across all nine codes under comparable prompting conditions.

Figures

Figures reproduced from arXiv: 2604.10834 by Fabio Massacci, Maria Camporese, Yuanjun Gong.

Figure 1
Figure 1. Figure 1: Annotation by human annotators and LLMs (a) The annotators get an initial understanding from the back￾ground information and the codebook. The annotators individually mark a large fraction of their comments1 . (b) At this point, a formal definition of the codebook is agreed (Column 1 and 2 of Table1). (2) Completed annotation with codebook and first review (a) The annotators get a deeper understanding of t… view at source ↗
Figure 3
Figure 3. Figure 3: The defined prompt P3+ based on P3𝑝𝑖𝑙𝑜𝑡 Pilot Experiment. For the exploration of the initial design of the prompts, we employed OpenAI’s GPT-4o model [18] (interactive version) to conduct the pilot study for the prompts refinement. We chose GPT-4o since this model is not used in the main experiments and can therefore reduce the bias. The output of each turn from the pilot LLM is manually reviewed for insig… view at source ↗
Figure 2
Figure 2. Figure 2: The pilot interaction where P3𝑝𝑖𝑙𝑜𝑡 is constructed on other comments. Few-shot examples are considered a sig￾nificant help for the LLMs, but they are not considered a good practice for human annotators, as they can be a source of bias and lack of independence. P3𝑝𝑖𝑙𝑜𝑡 : The prompt asks for conflicting examples and provides clarifications from the researcher.P3𝑝𝑖𝑙𝑜𝑡 asks the pilot LLM to go through the huma… view at source ↗
Figure 4
Figure 4. Figure 4: Average Cohen’s 𝜅 (Chance Corrected Accuracy if human were ground truth) vs Effort (Size of Prompts) (a) GPT-5 (b) Claude-4 (c) DeepSeek-V3.2 (d) Qwen3-Max [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Chance corrected accuracy against the required effort in tokens, with each curve corresponding to a different code. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

[Background:] Thematic analysis of free-text justifications in human experiments provides significant qualitative insights. Yet, it is costly because reliable annotations require multiple domain experts. Large language models (LLMs) seem ideal candidates to replace human annotators. [Problem:] Coding security-specific aspects (code identifiers mentioned, lines-of-code mentioned, security keywords mentioned) may require deeper contextual understanding than sentiment classification. [Objective:] Explore whether LLMs can act as automated annotators for technical security comments by human subjects. [Method:] We prompt four top-performing LLMs on LiveBench to detect nine security-relevant codes in free-text comments by human subjects analyzing vulnerable code snippets. Outputs are compared to human annotators using Cohen's Kappa (chance-corrected accuracy). We test different prompts mimicking annotation best practices, including emerging codes, detailed codebooks with examples, and conflicting examples. [Negative Results:] We observed marked improvements only when using detailed code descriptions; however, these improvements are not uniform across codes and are insufficient to reliably replace a human annotator. [Limitations:] Additional studies with more LLMs and annotation tasks are needed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript examines whether LLMs can serve as automated annotators for nine security-relevant codes in free-text comments from human subjects analyzing vulnerable code snippets. It evaluates four top LLMs using Cohen's Kappa against human annotations, testing prompting strategies that include detailed codebooks, examples, and conflicting cases. The central finding is that detailed code descriptions yield marked but non-uniform improvements, which remain insufficient to reliably replace human annotators.

Significance. If the results hold, the work supplies concrete empirical evidence on the limits of LLMs for domain-specific qualitative coding tasks in security research, where contextual understanding of code identifiers, lines, and keywords is required. The study is strengthened by its use of multiple LLMs, systematic variation of prompts that mimic annotation best practices, and direct chance-corrected comparison to independent human labels.

major comments (2)
  1. The claim that improvements 'are insufficient to reliably replace a human annotator' (abstract and conclusion) is load-bearing yet unanchored. No human inter-annotator agreement figures (e.g., pairwise Cohen's Kappa among the human coders) are reported, nor is any explicit threshold or parity criterion defined for 'reliable' replacement. Without this benchmark, the sufficiency judgment cannot be evaluated quantitatively.
  2. Method section: the number of free-text comments, number of human subjects, and total annotation instances are not stated in the provided abstract or method summary. These quantities are required to assess the statistical power of the reported Kappa values and the robustness of the negative result across codes.
minor comments (2)
  1. Abstract: the four LLMs are described only as 'top-performing on LiveBench'; naming them and providing at least one full prompt template would improve reproducibility.
  2. The limitations paragraph could explicitly note the absence of a human-human baseline as a direction for follow-up work.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of our work. We address each major comment below.

read point-by-point responses
  1. Referee: The claim that improvements 'are insufficient to reliably replace a human annotator' (abstract and conclusion) is load-bearing yet unanchored. No human inter-annotator agreement figures (e.g., pairwise Cohen's Kappa among the human coders) are reported, nor is any explicit threshold or parity criterion defined for 'reliable' replacement. Without this benchmark, the sufficiency judgment cannot be evaluated quantitatively.

    Authors: We agree that the claim requires better anchoring with an explicit criterion. In the revision we will define 'reliable replacement' using standard interpretations of Cohen's Kappa (substantial agreement at Kappa ≥ 0.61). We will also add a limitations paragraph stating that human inter-annotator agreement was not computed, as the study used expert annotations without pairwise independent coding. These changes will be made in the abstract, conclusion, and a new limitations subsection. revision: yes

  2. Referee: Method section: the number of free-text comments, number of human subjects, and total annotation instances are not stated in the provided abstract or method summary. These quantities are required to assess the statistical power of the reported Kappa values and the robustness of the negative result across codes.

    Authors: We thank the referee for noting this clarity issue. The full Method section already contains these quantities. In the revised manuscript we will add an explicit summary paragraph at the start of the Method section stating the number of human subjects, free-text comments, and total annotation instances to facilitate evaluation of statistical power and robustness. revision: yes

standing simulated objections not resolved
  • Human inter-annotator agreement figures (pairwise Cohen's Kappa) are not available, as the original study design did not include multiple independent annotations per comment for IAA calculation.

Circularity Check

0 steps flagged

No circularity: pure empirical comparison to independent human annotations

full rationale

The paper conducts an empirical study: it prompts four LLMs on security comment data, applies different prompt variants (codebooks, examples), and measures agreement with separate human annotations via Cohen's Kappa. No equations, fitted parameters, derived predictions, or self-referential derivations appear. The central claim rests on observed performance differences against an external human benchmark, not on any reduction to the paper's own inputs or prior self-citations. This matches the default case of a self-contained empirical evaluation with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on treating human annotations as ground truth and Cohen's Kappa as the right agreement metric; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Human annotations provide the correct ground truth labels for security codes.
    Standard assumption in annotation studies where experts define the reference labels.
  • standard math Cohen's Kappa is an appropriate chance-corrected measure of agreement for this task.
    Widely accepted statistical tool for inter-rater reliability in qualitative coding.

pith-pipeline@v0.9.0 · 5491 in / 1174 out tokens · 112896 ms · 2026-05-10T15:05:00.759229+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 11 canonical work pages · 6 internal anchors

  1. [1]

    Toufique Ahmed, Amiangshu Bosu, Anindya Iqbal, and Shahram Rahimi. 2017. SentiCR: A customized sentiment analysis tool for code review interactions. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 106–111

  2. [2]

    Anthropic. 2025. Claude 4: Claude Opus 4 & Claude Sonnet 4 Large Language Models. System card / technical documentation. https://www.anthropic.com/ news/claude-4 Includes Opus 4 and Sonnet 4, released May 22, 2025

  3. [3]

    Andreas Armborst. 2017. Thematic proximity in content analysis.Sage Open7, 2 (2017), 2158244017707797

  4. [4]

    Anthony G Barnston. 1992. Correspondence among the correlation, RMSE, and Heidke forecast verification measures; refinement of the Heidke score.Weather and Forecasting7, 4 (1992), 699–709

  5. [5]

    Avik Bhattacharjee. 2025. Introduction to Amazon Bedrock. InA Practical Guide to Generative AI Using Amazon Bedrock: Building, Deploying, and Securing Generative AI Applications. Springer, 79–112

  6. [6]

    David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research3, Jan (2003), 993–1022

  7. [7]

    Fabio Calefato, Filippo Lanubile, Federico Maiorano, and Nicole Novielli. 2018. Sentiment polarity detection for software development. InProceedings of the 40th International Conference on Software Engineering. 128–128

  8. [8]

    Fabio Calefato, Filippo Lanubile, and Nicole Novielli. 2017. Emotxt: a toolkit for emotion recognition from text. In2017 seventh international conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). IEEE, 79–80

  9. [9]

    Victoria Clarke and Virginia Braun. 2017. Thematic analysis.The journal of positive psychology12, 3 (2017), 297–298

  10. [10]

    Guilherme A Maldonado da Cruz, Elisa Hatsue Moriya Huzita, and Valéria D Feltrim. 2016. Estimating Trust in Virtual Teams-A Framework based on Sen- timent Analysis. InInternational Conference on Enterprise Information Systems, Vol. 2. SciTePress, 464–471

  11. [11]

    Shih-Chieh Dai, Aiping Xiong, and Lun-Wei Ku. 2023. LLM-in-the-loop: Leverag- ing large language model for thematic analysis.arXiv preprint arXiv:2310.15100 (2023)

  12. [12]

    DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Ha...

  13. [13]

    Zackary Okun Dunivin. 2024. Scalable qualitative coding with llms: Chain-of- thought reasoning matches human performance in some hermeneutic tasks. arXiv preprint arXiv:2401.15170(2024)

  14. [14]

    Denae Ford, Justin Smith, Philip J Guo, and Chris Parnin. 2016. Paradise un- plugged: Identifying barriers for female participation on stack overflow. InPro- ceedings of the 2016 24th ACM SIGSOFT International symposium on foundations of software engineering. 846–857

  15. [15]

    Robert Wayne Gregory, Mark Keil, Jan Muntermann, and Magnus Mähring

  16. [16]

    Information Systems Research26, 1 (2015), 57–80

    Paradoxes and the nature of ambidexterity in IT transformation programs. Information Systems Research26, 1 (2015), 57–80

  17. [17]

    2011.Applied thematic analysis

    Greg Guest, Kathleen M MacQueen, and Emily E Namey. 2011.Applied thematic analysis. sage publications

  18. [18]

    Wang, and Xinyu Xing

    Wenbo Guo, Dongliang Mu, Jun Xu, Purui Su, G. Wang, and Xinyu Xing. 2018. LEMNA: Explaining Deep Learning based Security Applications.Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (2018)

  19. [19]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

  20. [20]

    Md Rakibul Islam and Minhaz F Zibran. 2018. DEVA: sensing emotions in the valence arousal space in software engineering text. InProceedings of the 33rd annual ACM symposium on applied computing. 1536–1543

  21. [21]

    Nishant Jha and Anas Mahmoud. 2018. Using frame semantics for classifying and summarizing application store reviews.Empirical Software Engineering23, 6 (2018), 3734–3767

  22. [22]

    Nishant Jha and Anas Mahmoud. 2019. Mining non-functional requirements from app store reviews.Empirical Software Engineering24 (2019), 3659–3695

  23. [23]

    Jing Jiang, David Lo, Jiahuan He, Xin Xia, Pavneet Singh Kochhar, and Li Zhang

  24. [24]

    Why and how developers fork what from whom in GitHub.Empirical Software Engineering22, 1 (2017), 547–578

  25. [25]

    Brittany Johnson, Rahul Pandita, Justin Smith, Denae Ford, Sarah Elder, Emerson Murphy-Hill, Sarah Heckman, and Caitlin Sadowski. 2016. A cross-tool com- munication study on program analysis tool notifications. InProceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 73–84

  26. [26]

    Robbert Jongeling, Proshanta Sarkar, Subhajit Datta, and Alexander Serebrenik

  27. [27]

    On negative results when using sentiment analysis tools for software engineering research.Empirical Software Engineering22 (2017), 2543–2584

  28. [28]

    Rafael Kallis, Andrea Di Sorbo, Gerardo Canfora, and Sebastiano Panichella

  29. [29]

    In2019 IEEE International Conference on Software Maintenance and Evolution (ICSME)

    Ticket tagger: Machine learning driven issue classification. In2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 406–409

  30. [30]

    Kevin Danis Li, Adrian M Fernandez, Rachel Schwartz, Natalie Rios, Mar- vin Nathaniel Carlisle, Gregory M Amend, Hiren V Patel, and Benjamin N Breyer

  31. [31]

    Comparing GPT-4 and human researchers in health care data analysis: qualitative description study.Journal of Medical Internet Research26 (2024), e56500

  32. [32]

    Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. Vuldeepecker: A deep learning-based system for vulnerability detection.arXiv preprint arXiv:1801.01681(2018)

  33. [33]

    Bin Lin, Nathan Cassee, Alexander Serebrenik, Gabriele Bavota, Nicole Novielli, and Michele Lanza. 2022. Opinion mining for software development: a systematic literature review.ACM Transactions on Software Engineering and Methodology (TOSEM)31, 3 (2022), 1–41

  34. [34]

    Petter Mæhlum, David Samuel, Rebecka Maria Norman, Elma Jelin, Øyvind An- dresen Bjertnæs, Lilja Øvrelid, and Erik Velldal. 2024. It’s Difficult to be Neutral– Human and LLM-based Sentiment Annotation of Patient Comments.arXiv preprint arXiv:2404.18832(2024)

  35. [35]

    Mika V Mäntylä and Casper Lassenius. 2008. What types of defects are really discovered in code reviews?IEEE Transactions on Software Engineering35, 3 (2008), 430–448

  36. [36]

    Ggaliwango Marvin, Nakayiza Hellen, Daudi Jjingo, and Joyce Nakatumba- Nabende. 2023. Prompt engineering in large language models. InInternational conference on data intelligence and cognitive informatics. Springer, 387–402

  37. [37]

    Meta AI. 2024. The LLaMA 3 Herd of Models.arXiv preprint(2024). https: //arxiv.org Most recent LLaMA model family at time of writing

  38. [38]

    Mistral AI. 2024. Mixtral of Experts.arXiv preprint(2024). https://arxiv.org Sparse mixture-of-experts language model

  39. [39]

    OpenAI. 2025. GPT-5: Technical Overview and Capabilities. Model documenta- tion and release notes. https://openai.com Large multimodal foundation model released by OpenAI

  40. [40]

    OpenAI. 2025. OpenAI o3: Reasoning-Focused Large Language Model. Technical documentation and model card. https://openai.com Reasoning-optimized model in the OpenAI o-series

  41. [41]

    Aurora Papotti, Ranindya Paramitha, and Fabio Massacci. 2024. On the acceptance by code reviewers of candidate security patches suggested by Automated Program Repair tools.Empirical Software Engineering29, 5 (2024), 1–35

  42. [42]

    Aurora Papotti, Katja Tuma, and Fabio Massacci. 2025. On the effects of program slicing for vulnerability detection during code inspection.Empirical Software Engineering30, 3 (2025), 1–37

  43. [43]

    Mohammad Masudur Rahman, Chanchal K Roy, and Raula G Kula. 2017. Predict- ing usefulness of code review comments using textual features and developer experience. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 215–226

  44. [44]

    Zeeshan Rasheed, Muhammad Waseem, Aakash Ahmad, Kai-Kristian Kemell, Wang Xiaofeng, Anh Nguyen Duc, and Pekka Abrahamsson. 2024. Can large language models serve as data analysts? a multi-agent assisted approach for qualitative data analysis.arXiv preprint arXiv:2402.01386(2024)

  45. [45]

    Per Runeson and Martin Höst. 2009. Guidelines for conducting and reporting case study research in software engineering.Empirical software engineering14 ICPC ’26, April 12–13, 2026, Rio de Janeiro, Brazil Maria Camporese et al. (2009), 131–164

  46. [46]

    Antonino Sabetta, Serena Elisa Ponta, Rocio Cabrera Lozoya, Michele Bezzi, Tommaso Sacchetti, Matteo Greco, Gergő Balogh, Péter Hegedűs, Rudolf Ferenc, Ranindya Paramitha, et al. 2024. Known vulnerabilities of open source projects: Where are the fixes?IEEE Security & Privacy22, 2 (2024), 49–59

  47. [47]

    Hope Schroeder, Marianne Aubin Le Quéré, Casey Randazzo, David Mimno, and Sarita Schoenebeck. 2025. Large Language Models in Qualitative Research: Uses, Tensions, and Intentions. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–17

  48. [48]

    Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, et al

  49. [49]

    The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

    The prompt report: a systematic survey of prompt engineering techniques. arXiv preprint arXiv:2406.06608(2024)

  50. [50]

    Carolyn B. Seaman. 1999. Qualitative methods in empirical studies of software engineering.IEEE Transactions on software engineering25, 4 (1999), 557–572

  51. [51]

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al . 2025. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534(2025)

  52. [52]

    Mike Thelwall, Kevan Buckley, Georgios Paltoglou, Di Cai, and Arvid Kappas

  53. [53]

    Sentiment strength detection in short informal text.Journal of the American society for information science and technology61, 12 (2010), 2544–2558

  54. [54]

    Matthijs J Warrens. 2015. Five ways to look at Cohen’s kappa.Journal of Psychology & Psychotherapy5 (2015)

  55. [55]

    Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, et al. 2024. LiveBench: A challenging, contamination-limited LLM benchmark.arXiv preprint arXiv:2406.19314(2024)

  56. [56]

    2012.Experimentation in software engineering

    Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012.Experimentation in software engineering. Vol. 236. Springer

  57. [57]

    xAI. 2024. Grok: A Large Language Model by xAI. Model documentation and release announcement. https://x.ai Latest publicly released Grok model family at time of writing

  58. [58]

    Ziang Xiao, Xingdi Yuan, Q Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer

  59. [59]

    InCompanion proceedings of the 28th international conference on intelligent user interfaces

    Supporting qualitative analysis with large language models: Combining codebook with GPT-3 for deductive coding. InCompanion proceedings of the 28th international conference on intelligent user interfaces. 75–78

  60. [60]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  61. [61]

    Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding con- volutional networks. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13. Springer, 818–833

  62. [62]

    Yan Zhang and Barbara M Wildemuth. 2009. Qualitative analysis of content. Applications of social research methods to questions in information and library science308, 319 (2009), 1–12

  63. [63]

    Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing twitter and traditional media using topic models. InAdvances in Information Retrieval: 33rd European Conference on IR Research, ECIR 2011, Dublin, Ireland, April 18-21, 2011. Proceedings 33. Springer, 338–349