A Computational Method for Measuring "Open Codes" in Qualitative Analysis
Pith reviewed 2026-05-23 17:58 UTC · model grok-4.3
The pith
Four metrics quantify how much each coder contributes to a merged codebook in inductive qualitative analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an LLM-enriched algorithm can merge individual codebooks and that the resulting merged codebook serves as a fair reference against which each coder's contribution is measured by Coverage (how much of the merged set the coder covers), Overlap (shared codes), Novelty (unique codes added), and Divergence (codes that differ in interpretation).
What carries the argument
The LLM-enriched merging algorithm, which combines codes from multiple coders while attempting to retain their exploratory inputs, paired with the four metrics that compare each original codebook to the merged version.
If this is right
- Different merging algorithms produce different metric values, so the choice of merger must be reported.
- The four metrics stay consistent when the pipeline is repeated or when a new LLM is substituted.
- The metrics flag concrete coding problems such as coders generating too many codes or introducing codes not supported by the data.
Where Pith is reading between the lines
- Teams could use the novelty and divergence scores to decide whether to retain or revise AI-generated codes before final analysis.
- The same measurement approach could be tested on coding tasks outside conversation data, such as interview transcripts or field notes.
- If the metrics prove stable, they might serve as an automated check before researchers finalize a shared codebook.
Load-bearing premise
The merged codebook produced by the LLM algorithm fairly represents all exploratory contributions without the model adding its own systematic bias.
What would settle it
Re-running the full pipeline on the same dataset with a different large language model and obtaining substantially different metric scores for the human coders would show the results depend on the specific model chosen.
Figures
read the original abstract
Qualitative analysis is critical to understanding human datasets in many social science disciplines. A central method in this process is inductive coding, where researchers identify and interpret codes directly from the datasets themselves. Yet, this exploratory approach poses challenges for meeting methodological expectations (such as ``depth'' and ``variation''), especially as researchers increasingly adopt Generative AI (GAI) for support. Ground-truth-based metrics are insufficient because they contradict the exploratory nature of inductive coding, while manual evaluation can be labor-intensive. This paper presents a theory-informed computational method for measuring inductive coding results from humans and GAI. Our method first merges individual codebooks using an LLM-enriched algorithm. It measures each coder's contribution against the merged result using four novel metrics: Coverage, Overlap, Novelty, and Divergence. Through two experiments on a human-coded online conversation dataset, we 1) reveal the merging algorithm's impact on metrics; 2) validate the metrics' stability and robustness across multiple runs and different LLMs; and 3) showcase the metrics' ability to diagnose coding issues, such as excessive or irrelevant (hallucinated) codes. Our work provides a reliable pathway for ensuring methodological rigor in human-AI qualitative analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a computational method for evaluating inductive ('open') coding in qualitative analysis. It merges individual codebooks via an LLM-enriched algorithm and then scores each coder's contribution using four novel metrics (Coverage, Overlap, Novelty, Divergence) computed against the merged reference. Two experiments on a human-coded online conversation dataset are reported to demonstrate the merging algorithm's effects, the metrics' stability across runs and LLMs, and their diagnostic utility for issues such as excessive or hallucinated codes.
Significance. If the metrics can be shown to be stable and free of systematic LLM-induced bias, the approach would supply a scalable, ground-truth-free way to assess exploratory coding quality in human-AI workflows, addressing a recognized methodological gap in social-science qualitative research.
major comments (2)
- [Abstract / merging algorithm] Abstract and method description of the merging algorithm: all four metrics are defined relative to a single LLM-enriched merged codebook. This construction risks circular dependence; if the merge systematically favors certain phrasings, abstraction levels, or granularities (as LLMs are known to do), then Coverage and Divergence scores will partly reflect alignment with the model's priors rather than intrinsic exploratory quality. The reported stability across multiple LLMs does not demonstrate neutrality of the reference, and no external human-merged reference or inter-rater comparison against the LLM merge is described.
- [Abstract / validation experiments] Validation experiments (Abstract): the claims of stability, robustness, and diagnostic power rest on two experiments whose data details, exact metric equations, exclusion rules, sample sizes, and error analysis are not supplied in the provided description. Without these, it is impossible to evaluate whether the reported stability is load-bearing evidence or an artifact of the chosen dataset and LLM family.
minor comments (1)
- [Abstract] The acronym GAI is introduced without prior expansion; a single sentence defining it on first use would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, clarifying aspects of the method and experiments while committing to revisions that strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / merging algorithm] Abstract and method description of the merging algorithm: all four metrics are defined relative to a single LLM-enriched merged codebook. This construction risks circular dependence; if the merge systematically favors certain phrasings, abstraction levels, or granularities (as LLMs are known to do), then Coverage and Divergence scores will partly reflect alignment with the model's priors rather than intrinsic exploratory quality. The reported stability across multiple LLMs does not demonstrate neutrality of the reference, and no external human-merged reference or inter-rater comparison against the LLM merge is described.
Authors: We acknowledge the concern about potential circularity and LLM priors influencing the reference. The LLM-enriched merge is intended to synthesize a collective codebook from individual inductive contributions without relying on external ground truth, consistent with the exploratory goals of open coding. Stability across LLMs and runs is presented as evidence of robustness rather than full neutrality. We agree that direct comparison to a human-merged reference would provide stronger validation of the merge step and will add this analysis (including inter-rater agreement metrics between LLM and human merges) to the revised manuscript. revision: yes
-
Referee: [Abstract / validation experiments] Validation experiments (Abstract): the claims of stability, robustness, and diagnostic power rest on two experiments whose data details, exact metric equations, exclusion rules, sample sizes, and error analysis are not supplied in the provided description. Without these, it is impossible to evaluate whether the reported stability is load-bearing evidence or an artifact of the chosen dataset and LLM family.
Authors: The abstract summarizes the two experiments at a high level. The full manuscript provides the requested details: the online conversation dataset and human-coded codebooks (Section 4), exact equations for Coverage, Overlap, Novelty, and Divergence (Section 3), sample sizes, exclusion rules for code filtering, and error analysis of metric behavior. We will expand the abstract to include brief references to these elements and ensure all equations are explicitly stated for clarity. revision: partial
Circularity Check
No circularity: metrics defined relative to merge by design, no derivations or self-referential reductions
full rationale
The paper describes an LLM-enriched merging algorithm followed by four metrics (Coverage, Overlap, Novelty, Divergence) computed against the merged codebook. This structure is definitional to the proposed method rather than a derivation that reduces to fitted inputs or self-citations. No equations, uniqueness theorems, or load-bearing self-citations appear in the abstract or described content. Validation experiments test stability across LLMs and runs, which is an independent empirical check. The method is self-contained against external benchmarks with no reduction of predictions to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Anne Adams, Peter Lunt, and Paul Cairns. 2008. A Qualitative Approach to HCI Research . In Research Methods for Human - Computer Interaction . Cambridge University Press
work page 2008
-
[4]
Jennifer Attride-Stirling. 2001. https://doi.org/10.1177/146879410100100307 Thematic networks: an analytic tool for qualitative research . Qualitative Research, 1(3):385--405
-
[5]
Robert Bowman, Camille Nadal, Kellie Morrissey, Anja Thieme, and Gavin Doherty. 2023. https://doi.org/10.1145/3544548.3581203 Using Thematic Analysis in Healthcare HCI at CHI : A Scoping Review . In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , pages 1--18, Hamburg Germany. ACM
-
[6]
Virginia Braun and Victoria Clarke. 2006. https://doi.org/10.1191/1478088706qp063oa Using thematic analysis in psychology . Qualitative Research in Psychology, 3(2):77--101
-
[7]
Virginia Braun and Victoria Clarke. 2012. https://psycnet.apa.org/record/2011-23864-004 Thematic analysis. American Psychological Association
work page 2012
-
[8]
Virginia Braun and Victoria Clarke. 2013. Successful qualitative research: A practical guide for beginners
work page 2013
-
[9]
Virginia Braun and Victoria Clarke. 2021. https://doi.org/10.1080/14780887.2020.1769238 One size fits all? What counts as quality practice in (reflexive) thematic analysis? Qualitative Research in Psychology, 18(3):328--352
-
[10]
Joy D. Bringer, Lynne H. Johnston, and Celia H. Brackenridge. 2004. https://doi.org/10.1177/1468794104044434 Maximizing Transparency in a Doctoral Thesis1 : The Complexities of Writing About the Use of QSR * NVIVO Within a Grounded Theory Study . Qualitative Research, 4(2):247--265
-
[11]
Ariel Cascio, Eunlye Lee, Nicole Vaudrin, and Darcy A
M. Ariel Cascio, Eunlye Lee, Nicole Vaudrin, and Darcy A. Freedman. 2019. https://doi.org/10.1177/1525822X19838237 A Team -based Approach to Open Coding : Considerations for Creating Intercoder Consensus . Field Methods, 31(2):116--130
-
[12]
John Chen, Alexandros Lotsos, Grace Wang, Lexie Zhao, Bruce Sherin, Uri Wilensky, and Michael Horn. 2025. Processes matter: How ml/gai approaches could support open qualitative coding of online discourse datasets. In Proceedings of the 18th International Conference on Computer-Supported Collaborative Learning-CSCL 2025, pp. 415-419. International Society ...
work page 2025
-
[14]
Juliet Corbin and Anselm Strauss. 2008 b . https://doi.org/10.4135/9781452230153 Chapter 14 / Criteria for Evaluation . In Basics of Qualitative Research (3rd ed.): Techniques and Procedures for Developing Grounded Theory . SAGE Publications, Inc., 2455 Teller Road, Thousand Oaks California 91320 United States
-
[15]
Juliet M. Corbin and Anselm Strauss. 1990. https://idp.springer.com/authorize/casa?redirect_uri=https://link.springer.com/article/10.1007/bf00988593&casa_token=aBHJMqIs5a4AAAAA:ngulSWPiXoluZjWKFBIiPpeFVSSBQtx7ncsSpleI54sgSYiDmpFNzNPe96fXDyeVUwU1YO-miYiL3q_d Grounded theory research: Procedures , canons, and evaluative criteria . Qualitative sociology, 13(...
-
[16]
Shih-Chieh Dai, Aiping Xiong, and Lun-Wei Ku. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.669 LLM -in-the-loop: Leveraging large language model for thematic analysis . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9993--10001, Singapore. Association for Computational Linguistics
-
[17]
Stefano De Paoli. 2023 a . https://doi.org/10.48550/arXiv.2305.13014 Can Large Language Models emulate an inductive Thematic Analysis of semi-structured interviews? An exploration and provocation on the limits of the approach and the model . arXiv preprint. ArXiv:2305.13014 [cs]
-
[18]
Stefano De Paoli. 2023 b . https://doi.org/10.1177/08944393231220483 Performing an Inductive Thematic Analysis of Semi - Structured Interviews With a Large Language Model : An Exploration and Provocation on the Limits of the Approach . Social Science Computer Review, 0(0):1--23
-
[19]
Dominic Furniss, Ann Blandford, and Paul Curzon. 2011. https://doi.org/10.1145/1978942.1978960 Confessions from a grounded theory PhD : experiences and lessons learnt . In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems , pages 113--122, Vancouver BC Canada. ACM
-
[20]
Simret Araya Gebreegziabher, Zheng Zhang, Xiaohang Tang, Yihao Meng, Elena L. Glassman, and Toby Jia-Jun Li. 2023. https://doi.org/10.1145/3544548.3581352 PaTAT : Human - AI Collaborative Qualitative Coding with Explainable Interactive Rule Synthesis . In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , CHI '23, pages 1--19, ...
-
[21]
Maarten Grootendorst. 2022. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Sean Lee, Aamir Shakir, Darius Koenig, and Julius Lipp. 2024. https://www.mixedbread.ai/blog/mxbai-embed-large-v1 Open source strikes bread - new fluffy embeddings model
work page 2024
- [23]
-
[24]
Jasy Suet Yan Liew, Nancy McCracken, Shichun Zhou, and Kevin Crowston. 2014. https://doi.org/10.3115/v1/W14-2513 Optimizing Features in Active Machine Learning for Complex Qualitative Content Analysis . In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science , pages 44--48, Baltimore, MD, USA. Association for Comp...
-
[25]
Nora McDonald, Sarita Schoenebeck, and Andrea Forte. 2019. https://doi.org/10.1145/3359174 Reliability and Inter -rater Reliability in Qualitative Research : Norms and Guidelines for CSCW and HCI Practice . Proceedings of the ACM on Human-Computer Interaction, 3(CSCW):1--23
-
[26]
Angelina Parfenova, Andreas Marfurt, J \"u rgen Pfeffer, and Alexander Denzler. 2025. Text annotation via inductive coding: Comparing human experts to llms in qualitative data analysis. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 6456--6469
work page 2025
- [27]
-
[28]
Md Shidur Rahman. 2016. https://doi.org/10.5539/jel.v6n1p102 The Advantages and Disadvantages of Using Qualitative and Quantitative Approaches and Methods in Language “ Testing and Assessment ” Research : A Literature Review . Journal of Education and Learning, 6(1):102
-
[29]
Tim Rietz and Alexander Maedche. 2021. https://doi.org/10.1145/3411764.3445591 Cody: An AI - Based System to Semi - Automate Coding for Qualitative Research . In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , CHI '21, pages 1--14, New York, NY, USA. Association for Computing Machinery
-
[30]
Sina Mahdipour Saravani, Sadaf Ghaffari, Yanye Luther, James Folkestad, and Marcia Moraes. 2023. Automated code extraction from discussion. In Advances in Quantitative Ethnography: 4th International Conference, ICQE 2022, Copenhagen, Denmark, October 15--19, 2022, Proceedings, page 227. Springer Nature
work page 2023
-
[31]
Benjamin Saunders, Julius Sim, Tom Kingstone, Shula Baker, Jackie Waterfield, Bernadette Bartlam, Heather Burroughs, and Clare Jinks. 2018. https://doi.org/10.1007/s11135-017-0574-8 Saturation in qualitative research: exploring its conceptualization and operationalization . Quality & Quantity, 52(4):1893--1907
-
[32]
Carson Sievert and Kenneth Shirley. 2014. Ldavis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces, pages 63--70
work page 2014
-
[33]
Ravi Sinha, Idris Solola, Ha Nguyen, Hillary Swanson, and LuEttaMae Lawrence. 2024. https://doi.org/10.1145/3663433.3663456 The Role of Generative AI in Qualitative Research : GPT -4's Contributions to a Grounded Theory Analysis . In Proceedings of the Symposium on Learning , Design and Technology , pages 17--25, Delft Netherlands. ACM
-
[34]
Cesare Spinoso-Di Piano. 2023. Qualitative code suggestion: A human-centric approach to qualitative coding. McGill University (Canada)
work page 2023
-
[35]
Anselm Strauss and Juliet Corbin. 1998. Basics of qualitative research: Techniques and procedures for developing grounded theory, 2nd ed. Basics of qualitative research: Techniques and procedures for developing grounded theory, 2nd ed., pages xiii, 312--xiii, 312. Place: Thousand Oaks, CA, US Publisher: Sage Publications, Inc
work page 1998
-
[36]
Gemma Team. 2025 a . https://goo.gle/Gemma3Report Gemma 3
work page 2025
-
[37]
Qwen Team. 2025 b . https://qwenlm.github.io/blog/qwq-32b/ Qwq-32b: Embracing the power of reinforcement learning
work page 2025
-
[38]
Gareth Terry, Nikki Hayfield, Victoria Clarke, and Virginia Braun. 2017. https://books.google.com/books?hl=en&lr=&id=AAniDgAAQBAJ&oi=fnd&pg=PA17&dq=Thematic+analysis+terry+&ots=dpi2nmHiMV&sig=959tII4BUp9su6Hv2JJui1KjP5Q Thematic analysis . The SAGE handbook of qualitative research in psychology, 2(17-37):25. Publisher: SAGE Publications Ltd
work page 2017
-
[39]
David R. Thomas. 2006. https://doi.org/10.1177/1098214005283748 A General Inductive Approach for Analyzing Qualitative Evaluation Data . American Journal of Evaluation, 27(2):237--246
-
[40]
Anthony G. Tuckett. 2005. https://doi.org/10.5172/conu.19.1-2.75 Applying thematic analysis theory to practice: A researcher’s experience . Contemporary Nurse, 19(1-2):75--87
-
[41]
Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer
Ziang Xiao, Xingdi Yuan, Q. Vera Liao, Rania Abdelghani, and Pierre-Yves Oudeyer. 2023. https://doi.org/10.1145/3581754.3584136 Supporting Qualitative Analysis with Large Language Models : Combining Codebook with GPT -3 for Deductive Coding . In Companion Proceedings of the 28th International Conference on Intelligent User Interfaces , IUI '23 Companion ,...
-
[42]
Baker, Juhan Kim, and Nidhi Nasiar
Andres Felipe Zambrano, Xiner Liu, Amanda Barany, Ryan S. Baker, Juhan Kim, and Nidhi Nasiar. 2023. https://doi.org/10.1007/978-3-031-47014-1_32 From nCoder to ChatGPT : From Automated Coding to Refining Human Coding . In Advances in Quantitative Ethnography , Communications in Computer and Information Science , pages 470--485, Cham. Springer Nature Switzerland
-
[43]
Fengxiang Zhao, Fan Yu, and Yi Shang. 2024. A new method supporting qualitative data analysis through prompt generation for inductive coding. In 2024 IEEE International Conference on Information Reuse and Integration for Data Science (IRI), pages 164--169. IEEE
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.