CQA-Eval: Designing Reliable Evaluations of Multi-paragraph Clinical QA under Resource Constraints
Pith reviewed 2026-05-18 07:32 UTC · model grok-4.3
The pith
Fine-grained sentence annotations improve agreement on correctness while coarse annotations improve agreement on relevance in clinical QA evaluations, with risk judgments remaining inconsistent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that when physicians evaluate multi-paragraph answers to patient questions, fine-grained sentence-level annotation raises inter-annotator agreement on correctness, coarse answer-level annotation raises agreement on relevance, and agreement on whether risks are communicated stays low under either method. In addition, annotating only a small subset of sentences produces reliability comparable to full coarse annotation, which lowers the cost and effort required for evaluation.
What carries the argument
The CQA-Eval comparison of coarse answer-level versus fine-grained sentence-level physician annotations across correctness, relevance, and communicates-risks dimensions for multi-paragraph clinical answers.
If this is right
- Evaluators of clinical QA systems should prefer fine-grained annotation when the priority is measuring factual correctness.
- Coarse whole-answer annotation should be used when the priority is measuring relevance to the question.
- Assessments of risk communication will require methods beyond changes in annotation granularity because inconsistency persists.
- Resource-limited teams can annotate only a small subset of sentences and still obtain reliability close to full coarse review.
Where Pith is reading between the lines
- The same granularity trade-offs could be tested in other expert domains such as legal document QA where full review is also costly.
- Clinical LLM developers could incorporate these annotation rules into automated evaluation pipelines to reduce reliance on scarce physician time.
- Future experiments could check whether the patterns hold when non-physician annotators or hybrid human-AI judgment are used instead.
Load-bearing premise
The 300 real patient questions and the physician annotations collected for them are representative enough to support general recommendations about evaluation reliability for multi-paragraph clinical QA systems.
What would settle it
A new study on a different collection of patient questions or with a new group of physician annotators in which fine-grained annotation fails to raise correctness agreement or small sentence subsets fail to match coarse reliability would falsify the framework's recommendations.
read the original abstract
Evaluating multi-paragraph clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over multi-paragraph text is difficult. We introduce CQA-Eval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and risk disclosure. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on communicates-risks remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CQA-Eval, a framework and set of recommendations for evaluating multi-paragraph clinical QA systems under resource constraints. Using physician annotations on 300 real patient questions answered by both physicians and LLMs, it compares coarse answer-level versus fine-grained sentence-level evaluation across correctness, relevance, and communicates-risks. The central empirical claims are that fine-grained annotation improves inter-annotator agreement (IAA) on correctness, coarse annotation improves IAA on relevance, risk judgments remain inconsistent, and annotating only a small subset of sentences yields reliability comparable to full coarse annotations.
Significance. If the empirical patterns hold and generalize, the work offers practical value for clinical NLP evaluation by identifying annotation strategies that balance reliability and cost in high-expertise settings. The grounding in real patient questions and the focus on resource-constrained scenarios are strengths that could inform more efficient protocols for assessing LLM-based clinical QA systems.
major comments (2)
- [Data collection / annotation study description] The central recommendations for evaluation under resource constraints rest on annotations from a fixed collection of 300 real patient questions. The manuscript does not provide details on the sampling procedure, domain coverage (specialties, question types, answer lengths, risk profiles), or evidence that this set adequately represents the broader space of multi-paragraph clinical queries. This is load-bearing for extrapolating the IAA patterns and subset-comparability results to general guidelines.
- [Results] Results section: The abstract states the main IAA findings and subset-comparability claim but supplies no quantitative IAA coefficients, confidence intervals, statistical tests, or error analysis. Without these numbers it is not possible to assess whether the observed differences are reliable or whether the small-subset result truly matches coarse reliability.
minor comments (2)
- [Methods] Clarify the exact definition and operationalization of 'small subset of sentences' (e.g., selection method, size relative to full answer) in the methods or results.
- [Introduction / Framework description] The term 'CQA-Eval framework' is introduced but its concrete components (guidelines, annotation interface, aggregation rules) are not fully distinguished from the empirical study itself.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below, indicating planned revisions to improve clarity and support for our claims.
read point-by-point responses
-
Referee: [Data collection / annotation study description] The central recommendations for evaluation under resource constraints rest on annotations from a fixed collection of 300 real patient questions. The manuscript does not provide details on the sampling procedure, domain coverage (specialties, question types, answer lengths, risk profiles), or evidence that this set adequately represents the broader space of multi-paragraph clinical queries. This is load-bearing for extrapolating the IAA patterns and subset-comparability results to general guidelines.
Authors: We agree that additional details on sampling and coverage are needed to support generalizability. In the revised manuscript, we will add a new subsection under Data Collection describing the sampling procedure (including source of the 300 questions and selection criteria), distributions across specialties, question types, answer lengths, and risk profiles, along with a brief discussion of representativeness relative to broader clinical QA corpora and any limitations. revision: yes
-
Referee: [Results] Results section: The abstract states the main IAA findings and subset-comparability claim but supplies no quantitative IAA coefficients, confidence intervals, statistical tests, or error analysis. Without these numbers it is not possible to assess whether the observed differences are reliable or whether the small-subset result truly matches coarse reliability.
Authors: We agree that quantitative details are essential. We will revise the abstract to report specific IAA coefficients (e.g., Krippendorff's alpha or Cohen's kappa) for each dimension and annotation granularity, and expand the Results section to include confidence intervals, statistical comparisons of IAA between conditions, and a short error analysis of sources of disagreement. These changes will allow direct assessment of the reliability claims. revision: yes
Circularity Check
No circularity: empirical annotation study on fresh data
full rationale
The paper reports inter-annotator agreement results and subset-comparability observations directly from a new collection of physician annotations on 300 real patient questions. These findings are presented as empirical outcomes of the annotation process itself, with no equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations that reduce the claims to prior inputs by construction. The work is self-contained as a resource-constrained evaluation study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Physician annotations constitute a trustworthy ground truth for measuring correctness, relevance, and risk disclosure in clinical answers
invented entities (1)
-
CQA-Eval framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and safety... annotating only a small subset of sentences can provide reliability comparable to coarse annotations
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
From Table to Cell: Attention for Better Reasoning with TABALIGN
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.