arxiv: 2604.19309 · v1 · submitted 2026-04-21 · 💻 cs.HC · cs.AI

Recognition: unknown

Co-Refine: AI-Powered Tool Supporting Qualitative Analysis

Athikash Jeyaganthan , Kai Xu , Franziska Becker , Steffen Koch

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:07 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords qualitative codingcoding consistencytemporal driftAI augmentationLLM constraintsembedding metricsCAQDASreal-time audit

0 comments

The pith

Co-Refine constrains LLM outputs with deterministic embedding scores to deliver real-time audit signals for qualitative coding consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Qualitative coding of large text datasets often experiences temporal drift as researchers' interpretations of codes shift over time, which reduces the credibility of the resulting analysis. Existing CAQDAS tools manage data but provide no built-in way to detect these shifts in real time. Co-Refine addresses the gap by inserting an AI-augmented platform that runs a continuous three-stage audit without forcing the researcher to leave their normal workflow. The system first computes deterministic embedding-based consistency scores, then anchors any LLM verdicts to within 0.15 of those scores, and finally derives updated code definitions from the observed patterns. This produces a self-reinforcing loop of feedback that keeps coding aligned with the researcher's evolving intentions.

Core claim

Co-Refine is presented as an AI-augmented qualitative coding platform that uses a three-stage audit pipeline to supply continuous, grounded feedback on coding consistency. Stage 1 calculates deterministic embedding-based metrics for mathematical consistency. Stage 2 grounds LLM verdicts within ±0.15 of those deterministic scores. Stage 3 generates code definitions drawn from previous patterns to deepen the feedback loop. The work demonstrates that deterministic scoring can effectively constrain LLM outputs to produce reliable, real-time audit signals for qualitative analysis.

What carries the argument

The three-stage audit pipeline that first computes deterministic embedding-based consistency metrics, then grounds LLM verdicts within ±0.15 of those scores, and finally derives code definitions from observed patterns.

Load-bearing premise

Embedding-based metrics accurately reflect the researcher's intended code consistency and that grounding LLM outputs within ±0.15 of those scores produces reliable feedback.

What would settle it

A controlled comparison in which independent raters evaluate whether coders using Co-Refine show measurably less drift in code application than coders without it, and whether the tool's signals match the coders' actual intended meanings.

Figures

Figures reproduced from arXiv: 2604.19309 by Athikash Jeyaganthan, Franziska Becker, Kai Xu, Steffen Koch.

**Figure 1.** Figure 1: Overview of Co-Refine application. Abstract Qualitative coding relies on a researcher’s application of codes to textual data. As coding proceeds across large datasets, interpretations of codes often shift (temporal drift), reducing the credibility of the analysis. Existing Computer-Assisted Qualitative Data Analysis (CAQDAS) tools such as NVivo and ATLAS.ti provide excellent support for data management an… view at source ↗

**Figure 2.** Figure 2: High-level architecture of Co-Refine [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Real-time audit alert showing grounded LLM feed [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Qualitative coding relies on a researcher's application of codes to textual data. As coding proceeds across large datasets, interpretations of codes often shift (temporal drift), reducing the credibility of the analysis. Existing Computer-Assisted Qualitative Data Analysis (CAQDAS) tools provide support for data management but offer no workflow for real-time detection of these drifts. We present Co-Refine, an AI-augmented qualitative coding platform that delivers continuous, grounded feedback on coding consistency without disrupting the researcher's workflow. The system employs a three-stage audit pipeline: Stage 1 computes deterministic embedding-based metrics for mathematical consistency; Stage 2 grounds LLM verdicts within $\pm0.15$ of the deterministic scores; and Stage 3 produces code definitions from previous patterns to create a deepening feedback loop. Co-Refine demonstrates that deterministic scoring can effectively constrain LLM outputs to produce reliable, real-time audit signals for qualitative analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Co-Refine sketches a three-stage pipeline that chains embedding metrics with constrained LLM feedback to flag coding drift, but supplies no data showing the approach actually works.

read the letter

The paper's main contribution is a specific three-stage audit setup for qualitative coding: deterministic embedding scores for consistency, LLM verdicts kept within ±0.15 of those scores, and pattern-derived code definitions that feed back into the loop. This exact combination for real-time drift detection does not appear in the CAQDAS references they cite, so the architecture itself is new. It also tries to keep the tool non-disruptive, which matches a practical need when researchers code large text sets and interpretations start to shift over time. The description of how the stages interact is clear enough that someone could implement a version from the outline. The central weakness is the complete absence of evidence. No user study, no inter-rater agreement numbers, no ablation on the embedding metric, and no check on whether the ±0.15 window actually reduces drift or just adds noise. Qualitative codes are interpretive, so cosine similarity on vectors can easily miss or mislabel the consistency a researcher actually cares about; without tests against human judgments, the claim that deterministic scoring produces reliable signals stays unproven. The stress-test concern about the proxy holds up on the given description. This is for HCI or social-science readers who build or evaluate AI support for qualitative workflows. They could extract the pipeline idea even if the reliability part needs work. It deserves peer review because tool papers like this benefit from referee input on implementation choices and what a minimal evaluation would require, rather than being desk-rejected outright.

Referee Report

3 major / 2 minor

Summary. The manuscript presents Co-Refine, an AI-augmented platform for qualitative coding that addresses temporal drift in code interpretations via a three-stage audit pipeline. Stage 1 applies deterministic embedding-based metrics (e.g., cosine similarity) to assess mathematical consistency; Stage 2 constrains LLM verdicts to lie within ±0.15 of those scores; Stage 3 derives evolving code definitions from prior patterns to create a feedback loop. The central claim is that this architecture delivers reliable, real-time audit signals for coding consistency without disrupting the researcher's workflow, outperforming existing CAQDAS tools that lack such support.

Significance. If the reliability claims hold, the work would represent a meaningful advance in HCI and qualitative methods by demonstrating a hybrid deterministic-LLM workflow for interpretive tasks. It explicitly credits the constrained use of embeddings to anchor LLM output and the pattern-based deepening loop as mechanisms for grounded feedback. This could serve as a template for other domains requiring consistency in subjective labeling, provided the embedding proxy is validated.

major comments (3)

[Abstract] Abstract: The assertion that Co-Refine 'demonstrates that deterministic scoring can effectively constrain LLM outputs to produce reliable, real-time audit signals' is unsupported by any evaluation data, user studies, inter-rater agreement metrics, ablation results, or error analysis. This is load-bearing for the central claim because the reliability of the audit signals rests entirely on the untested premise that the Stage 1 metrics align with researcher intent.
[Stage 2] Stage 2 description: The specific grounding threshold of ±0.15 is introduced without justification, sensitivity analysis, or comparison to alternative bounds. If the deterministic embedding scores systematically diverge from interpretive consistency (as the skeptic concern notes), this fixed interval cannot guarantee reliable LLM feedback and may instead anchor outputs to a misaligned proxy.
[System Architecture] Overall pipeline (Stages 1 and 3): No evidence is provided that cosine-similarity or other embedding metrics serve as a valid proxy for the context-sensitive, temporally evolving nature of qualitative codes. The manuscript should include at least a pilot comparison against human-coded consistency to test for false negatives/positives before claiming the three-stage loop produces trustworthy signals.

minor comments (2)

[Figures] The manuscript would benefit from a figure or diagram explicitly showing data flow across the three stages, including how deterministic scores are computed and passed to the LLM.
[Related Work] Prior work on temporal drift in qualitative analysis and existing CAQDAS limitations could be cited more explicitly to sharpen the novelty claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We agree that several claims require clarification and that design choices need better justification. We have revised the paper accordingly to tone down unsupported assertions, provide rationale for parameters, and explicitly discuss limitations. Our responses to each major comment are provided below.

read point-by-point responses

Referee: [Abstract] The assertion that Co-Refine 'demonstrates that deterministic scoring can effectively constrain LLM outputs to produce reliable, real-time audit signals' is unsupported by any evaluation data, user studies, inter-rater agreement metrics, ablation results, or error analysis. This is load-bearing for the central claim because the reliability of the audit signals rests entirely on the untested premise that the Stage 1 metrics align with researcher intent.

Authors: We accept this point. The manuscript is a system description and does not contain empirical evaluations of reliability. We will revise the abstract to replace 'demonstrates' with 'presents a system designed to' and rephrase the central claim to focus on the architecture rather than proven reliability. A new Limitations section will be added that directly addresses the lack of user studies, inter-rater metrics, and validation of embedding alignment with researcher intent, framing these as necessary future work. revision: yes
Referee: [Stage 2] The specific grounding threshold of ±0.15 is introduced without justification, sensitivity analysis, or comparison to alternative bounds. If the deterministic embedding scores systematically diverge from interpretive consistency, this fixed interval cannot guarantee reliable LLM feedback.

Authors: The ±0.15 value was selected heuristically from internal development tests on sample qualitative datasets to constrain outputs without overly restricting the LLM. We agree this lacks transparency. In the revision we will explain the choice, make the threshold a configurable system parameter, and add an appendix with sensitivity analysis across a range of bounds (e.g., ±0.05 to ±0.25) showing effects on feedback stability and LLM adherence. revision: yes
Referee: [System Architecture] No evidence is provided that cosine-similarity or other embedding metrics serve as a valid proxy for the context-sensitive, temporally evolving nature of qualitative codes. The manuscript should include at least a pilot comparison against human-coded consistency to test for false negatives/positives.

Authors: We acknowledge the absence of direct validation for embeddings as a proxy. The design rationale draws on established NLP literature using embeddings for semantic similarity as an initial signal for potential drift. We will expand the related work and architecture sections to articulate this motivation more clearly and include a brief illustrative walkthrough with example codes and outputs. However, a formal pilot study comparing against human judgments requires new data collection and is outside the scope of the current revision; we will add this explicitly as a limitation and planned follow-up research. revision: partial

Circularity Check

0 steps flagged

No circularity: system architecture claim with no derivations or self-referential reductions

full rationale

The paper presents a three-stage pipeline (embedding metrics in Stage 1, LLM grounding to ±0.15 in Stage 2, pattern-based definitions in Stage 3) as an AI-augmented tool for detecting coding drift. No equations, parameter fittings, or mathematical derivations are described that reduce any 'prediction' or central result to its own inputs by construction. The ±0.15 threshold is an explicit design parameter, not a fitted value renamed as output. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The work is a descriptive system and workflow claim whose validity rests on external user studies or inter-rater metrics (not supplied here), not on internal circular reduction. This matches the default case of a self-contained non-derivational paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on untested assumptions about metric validity and LLM grounding; no free parameters or invented physical entities are introduced.

axioms (2)

domain assumption Embedding-based metrics can capture coding consistency
Stage 1 is presented as computing mathematical consistency without evidence or justification supplied in the abstract.
ad hoc to paper LLM verdicts grounded within ±0.15 of deterministic scores are reliable
Stage 2 assumes this tolerance produces trustworthy feedback without supporting data.

invented entities (1)

Co-Refine three-stage audit pipeline no independent evidence
purpose: To deliver continuous grounded feedback on coding consistency
New system component introduced by the paper with no independent evidence of effectiveness provided.

pith-pipeline@v0.9.0 · 5460 in / 1211 out tokens · 43061 ms · 2026-05-10T02:07:48.065989+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 10 canonical work pages

[1]

Srinivas Billa. 2024. SemEval-2016 ABSA Reviews English Translated Resam- pled Dataset. https://huggingface.co/datasets/srinivasbilla/semeval-2016-absa- reviews-english-translated-resampled

2024
[2]

Rosanna Cole. 2024. Inter-Rater Reliability Methods in Qualitative Case Study Research.Sociological Methods & Research53, 4 (2024), 1944–1975. arXiv:https://doi.org/10.1177/00491241231156971 doi:10.1177/ 00491241231156971

work page doi:10.1177/00491241231156971 2024
[3]

Gao et al

J. Gao et al. 2023. CoAIcoder: Examining the Effectiveness of AI-assisted Human- to-Human Collaboration in Qualitative Analysis. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. doi:10.1145/3617362

work page doi:10.1145/3617362 2023
[4]

Glassman

J. Gao et al. 2024. CollabCoder: A Lower-barrier, Rigorous Workflow for Inductive Collaborative Qualitative Analysis with Large Language Models. InProceedings of the CHI Conference on Human Factors in Computing Systems. doi:10.1145/ 3613904.3642002

work page arXiv 2024
[5]

S. A. Gebreegziabher et al. 2023. PaTAT: Human-AI Collaborative Qualitative Coding with Explainable Interactive Rule Synthesis. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. doi:10.1145/3544548.3581352

work page doi:10.1145/3544548.3581352 2023
[6]

MacQueen, and Emily E

Greg Guest, Kathleen M. MacQueen, and Emily E. Namey. 2012.Applied thematic analysis. Sage Publications, Los Angeles

2012
[7]

Kapania et al

S. Kapania et al. 2025. Simulacrum of Stories: Examining Large Language Models as Qualitative Research Participants. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. doi:10.1145/3706598.3713220

work page doi:10.1145/3706598.3713220 2025
[8]

M. S. Lam et al. 2024. Concept Induction: Analyzing Unstructured Text with High-Level Concepts Using LLooM. InProceedings of the CHI Conference on Human Factors in Computing Systems. doi:10.1145/3613904.3642830

work page doi:10.1145/3613904.3642830 2024
[9]

Xinyu Pi, Qisen Yang, and Chuong Nguyen. 2026. LOGOS: LLM-driven End- to-End Grounded Theory Development and Schema Induction for Qualitative Research. arXiv:2509.24294 [cs.CL] https://arxiv.org/abs/2509.24294

work page arXiv 2026
[10]

Tim Rietz and Alexander Maedche. 2021. Cody: An AI-Based System to Semi- Automate Coding for Qualitative Research. InProceedings of the CHI Conference on Human Factors in Computing Systems. doi:10.1145/3411764.3445591

work page doi:10.1145/3411764.3445591 2021
[11]

Ansh Sharma, Karen Cochrane, and James R Wallace. 2025. DeTAILS: Deep The- matic Analysis with Iterative LLM Support. InarXiv preprint arXiv:2510.17575

work page arXiv 2025
[12]

Leo Aleksander Siiman, Meeli Rannastu-Avalos, Johanna Pöysä-Tarhonen, Päivi Häkkinen, and Margus Pedaste. 2023. Opportunities and Challenges for AI- Assisted Qualitative Data Analysis: An Example from Collaborative Problem- Solving Discourse Data. InInternational Conference on Innovative Technologies and Learning. https://api.semanticscholar.org/CorpusID:...

2023
[13]

Emily Tseng, Thomas Ristenpart, and Nicola Dell. 2025. Mitigating Trauma in Qualitative Research Infrastructure: Roles for Machine Assistance and Trauma- Informed Design.Proceedings of the ACM on Human-Computer Interaction(2025). To appear at CSCW 2025

2025
[14]

Q Wang, M Erqsous, KE Barner, and ML Mauriello. 2025. LATA: A Pilot Study on LLM-Assisted Thematic Analysis of Online Social Network Data Generation Experiences.Proceedings of the ACM on Human-Computer Interaction9 (2025), 1–28

2025
[15]

Himanshu Zade, Margaret Drouhard, Bonnie Chinh, Lu Gan, and Cecilia Aragon
[16]

InProceedings of the 2018 CHI Conference on Human Factors in Computing Systems

Conceptualizing Disagreement in Qualitative Coding. InProceedings of the 2018 CHI Conference on Human Factors in Computing Systems. doi:10.1145/ 3173574.3173733

work page arXiv 2018