CQA-Eval: Designing Reliable Evaluations of Multi-paragraph Clinical QA under Resource Constraints

Federica Bologna; Lucy Lu Wang; Matthew Wilkens; Tiffany Pan; Yue Guo

arxiv: 2510.10415 · v3 · submitted 2025-10-12 · 💻 cs.CL · cs.AI

CQA-Eval: Designing Reliable Evaluations of Multi-paragraph Clinical QA under Resource Constraints

Federica Bologna , Tiffany Pan , Matthew Wilkens , Yue Guo , Lucy Lu Wang This is my paper

Pith reviewed 2026-05-18 07:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords clinical question answeringevaluation frameworkinter-annotator agreementfine-grained annotationmulti-paragraph answersresource constraintsphysician evaluationrisk disclosure

0 comments

The pith

Fine-grained sentence annotations improve agreement on correctness while coarse annotations improve agreement on relevance in clinical QA evaluations, with risk judgments remaining inconsistent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops CQA-Eval to guide reliable assessment of multi-paragraph answers to clinical questions when medical expertise is scarce and full review is expensive. It tests coarse whole-answer judgments against fine-grained sentence-by-sentence judgments on three dimensions using physician annotations of 300 real patient questions answered by both doctors and language models. The work finds that annotation detail affects consistency differently by dimension and that reviewing only a small subset of sentences can match the reliability of full coarse review. A reader would care because inconsistent or costly evaluations slow the safe deployment of clinical AI tools that must handle complex medical information without error or omission.

Core claim

The central claim is that when physicians evaluate multi-paragraph answers to patient questions, fine-grained sentence-level annotation raises inter-annotator agreement on correctness, coarse answer-level annotation raises agreement on relevance, and agreement on whether risks are communicated stays low under either method. In addition, annotating only a small subset of sentences produces reliability comparable to full coarse annotation, which lowers the cost and effort required for evaluation.

What carries the argument

The CQA-Eval comparison of coarse answer-level versus fine-grained sentence-level physician annotations across correctness, relevance, and communicates-risks dimensions for multi-paragraph clinical answers.

If this is right

Evaluators of clinical QA systems should prefer fine-grained annotation when the priority is measuring factual correctness.
Coarse whole-answer annotation should be used when the priority is measuring relevance to the question.
Assessments of risk communication will require methods beyond changes in annotation granularity because inconsistency persists.
Resource-limited teams can annotate only a small subset of sentences and still obtain reliability close to full coarse review.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same granularity trade-offs could be tested in other expert domains such as legal document QA where full review is also costly.
Clinical LLM developers could incorporate these annotation rules into automated evaluation pipelines to reduce reliance on scarce physician time.
Future experiments could check whether the patterns hold when non-physician annotators or hybrid human-AI judgment are used instead.

Load-bearing premise

The 300 real patient questions and the physician annotations collected for them are representative enough to support general recommendations about evaluation reliability for multi-paragraph clinical QA systems.

What would settle it

A new study on a different collection of patient questions or with a new group of physician annotators in which fine-grained annotation fails to raise correctness agreement or small sentence subsets fail to match coarse reliability would falsify the framework's recommendations.

read the original abstract

Evaluating multi-paragraph clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over multi-paragraph text is difficult. We introduce CQA-Eval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and risk disclosure. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on communicates-risks remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CQA-Eval shows that fine-grained annotation lifts IAA on correctness while small sentence subsets can match coarse reliability, but the 300-question sample limits how widely those patterns apply.

read the letter

The main thing to know is that this paper tests concrete annotation tweaks for evaluating multi-paragraph clinical QA when expert time is tight. Based on physician labels for 300 real patient questions, it reports that fine-grained sentence-level work improves agreement on correctness, coarse answer-level work helps on relevance, and risk communication stays noisy either way. It also claims that labeling just a small subset of sentences delivers reliability close to full coarse annotation.

Referee Report

2 major / 2 minor

Summary. The paper introduces CQA-Eval, a framework and set of recommendations for evaluating multi-paragraph clinical QA systems under resource constraints. Using physician annotations on 300 real patient questions answered by both physicians and LLMs, it compares coarse answer-level versus fine-grained sentence-level evaluation across correctness, relevance, and communicates-risks. The central empirical claims are that fine-grained annotation improves inter-annotator agreement (IAA) on correctness, coarse annotation improves IAA on relevance, risk judgments remain inconsistent, and annotating only a small subset of sentences yields reliability comparable to full coarse annotations.

Significance. If the empirical patterns hold and generalize, the work offers practical value for clinical NLP evaluation by identifying annotation strategies that balance reliability and cost in high-expertise settings. The grounding in real patient questions and the focus on resource-constrained scenarios are strengths that could inform more efficient protocols for assessing LLM-based clinical QA systems.

major comments (2)

[Data collection / annotation study description] The central recommendations for evaluation under resource constraints rest on annotations from a fixed collection of 300 real patient questions. The manuscript does not provide details on the sampling procedure, domain coverage (specialties, question types, answer lengths, risk profiles), or evidence that this set adequately represents the broader space of multi-paragraph clinical queries. This is load-bearing for extrapolating the IAA patterns and subset-comparability results to general guidelines.
[Results] Results section: The abstract states the main IAA findings and subset-comparability claim but supplies no quantitative IAA coefficients, confidence intervals, statistical tests, or error analysis. Without these numbers it is not possible to assess whether the observed differences are reliable or whether the small-subset result truly matches coarse reliability.

minor comments (2)

[Methods] Clarify the exact definition and operationalization of 'small subset of sentences' (e.g., selection method, size relative to full answer) in the methods or results.
[Introduction / Framework description] The term 'CQA-Eval framework' is introduced but its concrete components (guidelines, annotation interface, aggregation rules) are not fully distinguished from the empirical study itself.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below, indicating planned revisions to improve clarity and support for our claims.

read point-by-point responses

Referee: [Data collection / annotation study description] The central recommendations for evaluation under resource constraints rest on annotations from a fixed collection of 300 real patient questions. The manuscript does not provide details on the sampling procedure, domain coverage (specialties, question types, answer lengths, risk profiles), or evidence that this set adequately represents the broader space of multi-paragraph clinical queries. This is load-bearing for extrapolating the IAA patterns and subset-comparability results to general guidelines.

Authors: We agree that additional details on sampling and coverage are needed to support generalizability. In the revised manuscript, we will add a new subsection under Data Collection describing the sampling procedure (including source of the 300 questions and selection criteria), distributions across specialties, question types, answer lengths, and risk profiles, along with a brief discussion of representativeness relative to broader clinical QA corpora and any limitations. revision: yes
Referee: [Results] Results section: The abstract states the main IAA findings and subset-comparability claim but supplies no quantitative IAA coefficients, confidence intervals, statistical tests, or error analysis. Without these numbers it is not possible to assess whether the observed differences are reliable or whether the small-subset result truly matches coarse reliability.

Authors: We agree that quantitative details are essential. We will revise the abstract to report specific IAA coefficients (e.g., Krippendorff's alpha or Cohen's kappa) for each dimension and annotation granularity, and expand the Results section to include confidence intervals, statistical comparisons of IAA between conditions, and a short error analysis of sources of disagreement. These changes will allow direct assessment of the reliability claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical annotation study on fresh data

full rationale

The paper reports inter-annotator agreement results and subset-comparability observations directly from a new collection of physician annotations on 300 real patient questions. These findings are presented as empirical outcomes of the annotation process itself, with no equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations that reduce the claims to prior inputs by construction. The work is self-contained as a resource-constrained evaluation study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on physician annotations serving as a reliable proxy for evaluation quality and on the 300 questions being representative of real clinical use.

axioms (1)

domain assumption Physician annotations constitute a trustworthy ground truth for measuring correctness, relevance, and risk disclosure in clinical answers
The entire comparison of coarse versus fine-grained methods is built on these annotations.

invented entities (1)

CQA-Eval framework no independent evidence
purpose: To organize evaluation recommendations for resource-constrained clinical QA settings
Newly proposed in the paper with no independent prior existence shown.

pith-pipeline@v0.9.0 · 5671 in / 1310 out tokens · 51232 ms · 2026-05-18T07:32:56.731184+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and safety... annotating only a small subset of sentences can provide reliability comparable to coarse annotations
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Table to Cell: Attention for Better Reasoning with TABALIGN
cs.AI 2026-05 unverdicted novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...