pith. sign in

arxiv: 2510.10415 · v3 · submitted 2025-10-12 · 💻 cs.CL · cs.AI

CQA-Eval: Designing Reliable Evaluations of Multi-paragraph Clinical QA under Resource Constraints

Pith reviewed 2026-05-18 07:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords clinical question answeringevaluation frameworkinter-annotator agreementfine-grained annotationmulti-paragraph answersresource constraintsphysician evaluationrisk disclosure
0
0 comments X

The pith

Fine-grained sentence annotations improve agreement on correctness while coarse annotations improve agreement on relevance in clinical QA evaluations, with risk judgments remaining inconsistent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops CQA-Eval to guide reliable assessment of multi-paragraph answers to clinical questions when medical expertise is scarce and full review is expensive. It tests coarse whole-answer judgments against fine-grained sentence-by-sentence judgments on three dimensions using physician annotations of 300 real patient questions answered by both doctors and language models. The work finds that annotation detail affects consistency differently by dimension and that reviewing only a small subset of sentences can match the reliability of full coarse review. A reader would care because inconsistent or costly evaluations slow the safe deployment of clinical AI tools that must handle complex medical information without error or omission.

Core claim

The central claim is that when physicians evaluate multi-paragraph answers to patient questions, fine-grained sentence-level annotation raises inter-annotator agreement on correctness, coarse answer-level annotation raises agreement on relevance, and agreement on whether risks are communicated stays low under either method. In addition, annotating only a small subset of sentences produces reliability comparable to full coarse annotation, which lowers the cost and effort required for evaluation.

What carries the argument

The CQA-Eval comparison of coarse answer-level versus fine-grained sentence-level physician annotations across correctness, relevance, and communicates-risks dimensions for multi-paragraph clinical answers.

If this is right

  • Evaluators of clinical QA systems should prefer fine-grained annotation when the priority is measuring factual correctness.
  • Coarse whole-answer annotation should be used when the priority is measuring relevance to the question.
  • Assessments of risk communication will require methods beyond changes in annotation granularity because inconsistency persists.
  • Resource-limited teams can annotate only a small subset of sentences and still obtain reliability close to full coarse review.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same granularity trade-offs could be tested in other expert domains such as legal document QA where full review is also costly.
  • Clinical LLM developers could incorporate these annotation rules into automated evaluation pipelines to reduce reliance on scarce physician time.
  • Future experiments could check whether the patterns hold when non-physician annotators or hybrid human-AI judgment are used instead.

Load-bearing premise

The 300 real patient questions and the physician annotations collected for them are representative enough to support general recommendations about evaluation reliability for multi-paragraph clinical QA systems.

What would settle it

A new study on a different collection of patient questions or with a new group of physician annotators in which fine-grained annotation fails to raise correctness agreement or small sentence subsets fail to match coarse reliability would falsify the framework's recommendations.

read the original abstract

Evaluating multi-paragraph clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over multi-paragraph text is difficult. We introduce CQA-Eval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and risk disclosure. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on communicates-risks remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CQA-Eval, a framework and set of recommendations for evaluating multi-paragraph clinical QA systems under resource constraints. Using physician annotations on 300 real patient questions answered by both physicians and LLMs, it compares coarse answer-level versus fine-grained sentence-level evaluation across correctness, relevance, and communicates-risks. The central empirical claims are that fine-grained annotation improves inter-annotator agreement (IAA) on correctness, coarse annotation improves IAA on relevance, risk judgments remain inconsistent, and annotating only a small subset of sentences yields reliability comparable to full coarse annotations.

Significance. If the empirical patterns hold and generalize, the work offers practical value for clinical NLP evaluation by identifying annotation strategies that balance reliability and cost in high-expertise settings. The grounding in real patient questions and the focus on resource-constrained scenarios are strengths that could inform more efficient protocols for assessing LLM-based clinical QA systems.

major comments (2)
  1. [Data collection / annotation study description] The central recommendations for evaluation under resource constraints rest on annotations from a fixed collection of 300 real patient questions. The manuscript does not provide details on the sampling procedure, domain coverage (specialties, question types, answer lengths, risk profiles), or evidence that this set adequately represents the broader space of multi-paragraph clinical queries. This is load-bearing for extrapolating the IAA patterns and subset-comparability results to general guidelines.
  2. [Results] Results section: The abstract states the main IAA findings and subset-comparability claim but supplies no quantitative IAA coefficients, confidence intervals, statistical tests, or error analysis. Without these numbers it is not possible to assess whether the observed differences are reliable or whether the small-subset result truly matches coarse reliability.
minor comments (2)
  1. [Methods] Clarify the exact definition and operationalization of 'small subset of sentences' (e.g., selection method, size relative to full answer) in the methods or results.
  2. [Introduction / Framework description] The term 'CQA-Eval framework' is introduced but its concrete components (guidelines, annotation interface, aggregation rules) are not fully distinguished from the empirical study itself.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below, indicating planned revisions to improve clarity and support for our claims.

read point-by-point responses
  1. Referee: [Data collection / annotation study description] The central recommendations for evaluation under resource constraints rest on annotations from a fixed collection of 300 real patient questions. The manuscript does not provide details on the sampling procedure, domain coverage (specialties, question types, answer lengths, risk profiles), or evidence that this set adequately represents the broader space of multi-paragraph clinical queries. This is load-bearing for extrapolating the IAA patterns and subset-comparability results to general guidelines.

    Authors: We agree that additional details on sampling and coverage are needed to support generalizability. In the revised manuscript, we will add a new subsection under Data Collection describing the sampling procedure (including source of the 300 questions and selection criteria), distributions across specialties, question types, answer lengths, and risk profiles, along with a brief discussion of representativeness relative to broader clinical QA corpora and any limitations. revision: yes

  2. Referee: [Results] Results section: The abstract states the main IAA findings and subset-comparability claim but supplies no quantitative IAA coefficients, confidence intervals, statistical tests, or error analysis. Without these numbers it is not possible to assess whether the observed differences are reliable or whether the small-subset result truly matches coarse reliability.

    Authors: We agree that quantitative details are essential. We will revise the abstract to report specific IAA coefficients (e.g., Krippendorff's alpha or Cohen's kappa) for each dimension and annotation granularity, and expand the Results section to include confidence intervals, statistical comparisons of IAA between conditions, and a short error analysis of sources of disagreement. These changes will allow direct assessment of the reliability claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical annotation study on fresh data

full rationale

The paper reports inter-annotator agreement results and subset-comparability observations directly from a new collection of physician annotations on 300 real patient questions. These findings are presented as empirical outcomes of the annotation process itself, with no equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations that reduce the claims to prior inputs by construction. The work is self-contained as a resource-constrained evaluation study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on physician annotations serving as a reliable proxy for evaluation quality and on the 300 questions being representative of real clinical use.

axioms (1)
  • domain assumption Physician annotations constitute a trustworthy ground truth for measuring correctness, relevance, and risk disclosure in clinical answers
    The entire comparison of coarse versus fine-grained methods is built on these annotations.
invented entities (1)
  • CQA-Eval framework no independent evidence
    purpose: To organize evaluation recommendations for resource-constrained clinical QA settings
    Newly proposed in the paper with no independent prior existence shown.

pith-pipeline@v0.9.0 · 5671 in / 1310 out tokens · 51232 ms · 2026-05-18T07:32:56.731184+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Table to Cell: Attention for Better Reasoning with TABALIGN

    cs.AI 2026-05 unverdicted novelty 7.0

    TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...