Recognition: unknown
Co-Refine: AI-Powered Tool Supporting Qualitative Analysis
Pith reviewed 2026-05-10 02:07 UTC · model grok-4.3
The pith
Co-Refine constrains LLM outputs with deterministic embedding scores to deliver real-time audit signals for qualitative coding consistency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Co-Refine is presented as an AI-augmented qualitative coding platform that uses a three-stage audit pipeline to supply continuous, grounded feedback on coding consistency. Stage 1 calculates deterministic embedding-based metrics for mathematical consistency. Stage 2 grounds LLM verdicts within ±0.15 of those deterministic scores. Stage 3 generates code definitions drawn from previous patterns to deepen the feedback loop. The work demonstrates that deterministic scoring can effectively constrain LLM outputs to produce reliable, real-time audit signals for qualitative analysis.
What carries the argument
The three-stage audit pipeline that first computes deterministic embedding-based consistency metrics, then grounds LLM verdicts within ±0.15 of those scores, and finally derives code definitions from observed patterns.
Load-bearing premise
Embedding-based metrics accurately reflect the researcher's intended code consistency and that grounding LLM outputs within ±0.15 of those scores produces reliable feedback.
What would settle it
A controlled comparison in which independent raters evaluate whether coders using Co-Refine show measurably less drift in code application than coders without it, and whether the tool's signals match the coders' actual intended meanings.
Figures
read the original abstract
Qualitative coding relies on a researcher's application of codes to textual data. As coding proceeds across large datasets, interpretations of codes often shift (temporal drift), reducing the credibility of the analysis. Existing Computer-Assisted Qualitative Data Analysis (CAQDAS) tools provide support for data management but offer no workflow for real-time detection of these drifts. We present Co-Refine, an AI-augmented qualitative coding platform that delivers continuous, grounded feedback on coding consistency without disrupting the researcher's workflow. The system employs a three-stage audit pipeline: Stage 1 computes deterministic embedding-based metrics for mathematical consistency; Stage 2 grounds LLM verdicts within $\pm0.15$ of the deterministic scores; and Stage 3 produces code definitions from previous patterns to create a deepening feedback loop. Co-Refine demonstrates that deterministic scoring can effectively constrain LLM outputs to produce reliable, real-time audit signals for qualitative analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Co-Refine, an AI-augmented platform for qualitative coding that addresses temporal drift in code interpretations via a three-stage audit pipeline. Stage 1 applies deterministic embedding-based metrics (e.g., cosine similarity) to assess mathematical consistency; Stage 2 constrains LLM verdicts to lie within ±0.15 of those scores; Stage 3 derives evolving code definitions from prior patterns to create a feedback loop. The central claim is that this architecture delivers reliable, real-time audit signals for coding consistency without disrupting the researcher's workflow, outperforming existing CAQDAS tools that lack such support.
Significance. If the reliability claims hold, the work would represent a meaningful advance in HCI and qualitative methods by demonstrating a hybrid deterministic-LLM workflow for interpretive tasks. It explicitly credits the constrained use of embeddings to anchor LLM output and the pattern-based deepening loop as mechanisms for grounded feedback. This could serve as a template for other domains requiring consistency in subjective labeling, provided the embedding proxy is validated.
major comments (3)
- [Abstract] Abstract: The assertion that Co-Refine 'demonstrates that deterministic scoring can effectively constrain LLM outputs to produce reliable, real-time audit signals' is unsupported by any evaluation data, user studies, inter-rater agreement metrics, ablation results, or error analysis. This is load-bearing for the central claim because the reliability of the audit signals rests entirely on the untested premise that the Stage 1 metrics align with researcher intent.
- [Stage 2] Stage 2 description: The specific grounding threshold of ±0.15 is introduced without justification, sensitivity analysis, or comparison to alternative bounds. If the deterministic embedding scores systematically diverge from interpretive consistency (as the skeptic concern notes), this fixed interval cannot guarantee reliable LLM feedback and may instead anchor outputs to a misaligned proxy.
- [System Architecture] Overall pipeline (Stages 1 and 3): No evidence is provided that cosine-similarity or other embedding metrics serve as a valid proxy for the context-sensitive, temporally evolving nature of qualitative codes. The manuscript should include at least a pilot comparison against human-coded consistency to test for false negatives/positives before claiming the three-stage loop produces trustworthy signals.
minor comments (2)
- [Figures] The manuscript would benefit from a figure or diagram explicitly showing data flow across the three stages, including how deterministic scores are computed and passed to the LLM.
- [Related Work] Prior work on temporal drift in qualitative analysis and existing CAQDAS limitations could be cited more explicitly to sharpen the novelty claim.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our manuscript. We agree that several claims require clarification and that design choices need better justification. We have revised the paper accordingly to tone down unsupported assertions, provide rationale for parameters, and explicitly discuss limitations. Our responses to each major comment are provided below.
read point-by-point responses
-
Referee: [Abstract] The assertion that Co-Refine 'demonstrates that deterministic scoring can effectively constrain LLM outputs to produce reliable, real-time audit signals' is unsupported by any evaluation data, user studies, inter-rater agreement metrics, ablation results, or error analysis. This is load-bearing for the central claim because the reliability of the audit signals rests entirely on the untested premise that the Stage 1 metrics align with researcher intent.
Authors: We accept this point. The manuscript is a system description and does not contain empirical evaluations of reliability. We will revise the abstract to replace 'demonstrates' with 'presents a system designed to' and rephrase the central claim to focus on the architecture rather than proven reliability. A new Limitations section will be added that directly addresses the lack of user studies, inter-rater metrics, and validation of embedding alignment with researcher intent, framing these as necessary future work. revision: yes
-
Referee: [Stage 2] The specific grounding threshold of ±0.15 is introduced without justification, sensitivity analysis, or comparison to alternative bounds. If the deterministic embedding scores systematically diverge from interpretive consistency, this fixed interval cannot guarantee reliable LLM feedback.
Authors: The ±0.15 value was selected heuristically from internal development tests on sample qualitative datasets to constrain outputs without overly restricting the LLM. We agree this lacks transparency. In the revision we will explain the choice, make the threshold a configurable system parameter, and add an appendix with sensitivity analysis across a range of bounds (e.g., ±0.05 to ±0.25) showing effects on feedback stability and LLM adherence. revision: yes
-
Referee: [System Architecture] No evidence is provided that cosine-similarity or other embedding metrics serve as a valid proxy for the context-sensitive, temporally evolving nature of qualitative codes. The manuscript should include at least a pilot comparison against human-coded consistency to test for false negatives/positives.
Authors: We acknowledge the absence of direct validation for embeddings as a proxy. The design rationale draws on established NLP literature using embeddings for semantic similarity as an initial signal for potential drift. We will expand the related work and architecture sections to articulate this motivation more clearly and include a brief illustrative walkthrough with example codes and outputs. However, a formal pilot study comparing against human judgments requires new data collection and is outside the scope of the current revision; we will add this explicitly as a limitation and planned follow-up research. revision: partial
Circularity Check
No circularity: system architecture claim with no derivations or self-referential reductions
full rationale
The paper presents a three-stage pipeline (embedding metrics in Stage 1, LLM grounding to ±0.15 in Stage 2, pattern-based definitions in Stage 3) as an AI-augmented tool for detecting coding drift. No equations, parameter fittings, or mathematical derivations are described that reduce any 'prediction' or central result to its own inputs by construction. The ±0.15 threshold is an explicit design parameter, not a fitted value renamed as output. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The work is a descriptive system and workflow claim whose validity rests on external user studies or inter-rater metrics (not supplied here), not on internal circular reduction. This matches the default case of a self-contained non-derivational paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Embedding-based metrics can capture coding consistency
- ad hoc to paper LLM verdicts grounded within ±0.15 of deterministic scores are reliable
invented entities (1)
-
Co-Refine three-stage audit pipeline
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Srinivas Billa. 2024. SemEval-2016 ABSA Reviews English Translated Resam- pled Dataset. https://huggingface.co/datasets/srinivasbilla/semeval-2016-absa- reviews-english-translated-resampled
2024
-
[2]
Rosanna Cole. 2024. Inter-Rater Reliability Methods in Qualitative Case Study Research.Sociological Methods & Research53, 4 (2024), 1944–1975. arXiv:https://doi.org/10.1177/00491241231156971 doi:10.1177/ 00491241231156971
-
[3]
J. Gao et al. 2023. CoAIcoder: Examining the Effectiveness of AI-assisted Human- to-Human Collaboration in Qualitative Analysis. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. doi:10.1145/3617362
- [4]
-
[5]
S. A. Gebreegziabher et al. 2023. PaTAT: Human-AI Collaborative Qualitative Coding with Explainable Interactive Rule Synthesis. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. doi:10.1145/3544548.3581352
-
[6]
MacQueen, and Emily E
Greg Guest, Kathleen M. MacQueen, and Emily E. Namey. 2012.Applied thematic analysis. Sage Publications, Los Angeles
2012
-
[7]
S. Kapania et al. 2025. Simulacrum of Stories: Examining Large Language Models as Qualitative Research Participants. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. doi:10.1145/3706598.3713220
-
[8]
M. S. Lam et al. 2024. Concept Induction: Analyzing Unstructured Text with High-Level Concepts Using LLooM. InProceedings of the CHI Conference on Human Factors in Computing Systems. doi:10.1145/3613904.3642830
- [9]
-
[10]
Tim Rietz and Alexander Maedche. 2021. Cody: An AI-Based System to Semi- Automate Coding for Qualitative Research. InProceedings of the CHI Conference on Human Factors in Computing Systems. doi:10.1145/3411764.3445591
- [11]
-
[12]
Leo Aleksander Siiman, Meeli Rannastu-Avalos, Johanna Pöysä-Tarhonen, Päivi Häkkinen, and Margus Pedaste. 2023. Opportunities and Challenges for AI- Assisted Qualitative Data Analysis: An Example from Collaborative Problem- Solving Discourse Data. InInternational Conference on Innovative Technologies and Learning. https://api.semanticscholar.org/CorpusID:...
2023
-
[13]
Emily Tseng, Thomas Ristenpart, and Nicola Dell. 2025. Mitigating Trauma in Qualitative Research Infrastructure: Roles for Machine Assistance and Trauma- Informed Design.Proceedings of the ACM on Human-Computer Interaction(2025). To appear at CSCW 2025
2025
-
[14]
Q Wang, M Erqsous, KE Barner, and ML Mauriello. 2025. LATA: A Pilot Study on LLM-Assisted Thematic Analysis of Online Social Network Data Generation Experiences.Proceedings of the ACM on Human-Computer Interaction9 (2025), 1–28
2025
-
[15]
Himanshu Zade, Margaret Drouhard, Bonnie Chinh, Lu Gan, and Cecilia Aragon
-
[16]
InProceedings of the 2018 CHI Conference on Human Factors in Computing Systems
Conceptualizing Disagreement in Qualitative Coding. InProceedings of the 2018 CHI Conference on Human Factors in Computing Systems. doi:10.1145/ 3173574.3173733
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.