arxiv: 2604.14175 · v1 · submitted 2026-03-26 · 💻 cs.CL · cs.AI

Recognition: no theorem link

QU-NLP at ArchEHR-QA 2026: Two-Stage QLoRA Fine-Tuning of Qwen3-4B for Patient-Oriented Clinical Question Answering and Evidence Sentence Alignment

Mohammad AL-Smadi

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords clinical question answeringQLoRA fine-tuningevidence sentence alignmentelectronic health recordsQwen3-4Bdata scarcityshared taskretrieval ensemble

0 comments

The pith

Two-stage QLoRA fine-tuning of Qwen3-4B on 30,000 clinical samples then 20 cases produces an overall score of 32.87 for patient question answering while exposing data scarcity as the core limit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors build a unified system for generating answers to clinical questions from patient records and for aligning those answers to the exact sentences that support them. They fine-tune the Qwen3-4B model in two stages using QLoRA: first on thirty thousand samples from a medical QA corpus to gain clinical knowledge, then on the twenty available development examples to match the required output format. The resulting system reaches an overall score of 32.87 on the official test split across six automatic metrics. Both subtasks suffer from the same core problem: with so few annotated examples the model cannot reliably tell which sentences in a note are relevant to the question. The work therefore points to data augmentation as the most promising direction for improvement.

Core claim

A two-stage QLoRA fine-tuning procedure applied to Qwen3-4B first establishes clinical domain competence on 30,000 emrQA-MedSQuAD samples and then adapts to the specific answer generation style using only 20 development cases. This produces an overall score of 32.87 on the test-2026 split, with component scores BLEU 9.42, ROUGE-L 27.04, SARI 55.42, BERTScore 43.00, AlignScore 25.28 and MEDCON 37.04. For evidence sentence alignment a weighted ensemble of BM25, TF-IDF cosine similarity and a fine-tuned cross-encoder reaches micro-F1 of 67.16. The experiments indicate that twenty annotated cases are too few to teach the model how to distinguish relevant from irrelevant clinical sentences.

What carries the argument

Two-stage QLoRA adaptation of Qwen3-4B for clinical QA generation combined with an ensemble retriever for evidence alignment.

If this is right

The system generates answers scoring BLEU 9.42, ROUGE-L 27.04, SARI 55.42, BERTScore 43.00, AlignScore 25.28 and MEDCON 37.04 on the unseen test set.
Evidence sentence alignment reaches micro-F1 of 67.16 on the 100-case test set using the weighted ensemble of BM25, TF-IDF and cross-encoder.
Both answer generation and evidence alignment are limited by the inability to distinguish relevant from irrelevant clinical sentences when only 20 annotated cases are available.
Data augmentation is identified as the highest-leverage next step to overcome the data scarcity that affects both subtasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scaling the number of annotated clinical cases would likely improve the model's ability to select relevant evidence sentences.
The two-stage domain-then-task adaptation pattern could be tested on other low-resource medical text tasks.
Jointly training generation and alignment objectives in one model might reduce dependence on large annotated sets.

Load-bearing premise

That initial training on 30,000 general clinical QA examples followed by adaptation on only 20 specific cases is enough to create a model that generalizes to new patient questions and correctly identifies supporting sentences.

What would settle it

Training a version of the model without the first-stage clinical pre-adaptation and checking whether it matches or exceeds the reported test scores; or measuring whether adding more than 20 annotated cases improves the separation of relevant and irrelevant sentences.

Figures

Figures reproduced from arXiv: 2604.14175 by Mohammad AL-Smadi.

**Figure 1.** Figure 1: shows the inference prompt template. Both the patient question and the clinicianinterpreted reformulation are included, allowing the model to target the precise clinical sub-question while maintaining accessible phrasing. Note sentences are rendered with their original numeric identifiers (N: text\n) to preserve the relevanceannotation indexing used in evaluation. The system instruction specifies: profe… view at source ↗

read the original abstract

We present a unified system addressing both Subtask 3 (answer generation) and Subtask 4 (evidence sentence alignment) of the ArchEHR-QA Shared Task. For Subtask 3, we apply two-stage Quantised Low-Rank Adaptation (QLoRA) to Qwen3-4B loaded in 4-bit NF4 quantisation: first on 30,000 samples from the emrQA-MedSQuAD corpus to establish clinical domain competence, then on the 20 annotated development cases to learn the task-specific output style. Our system achieves an overall score of 32.87 on the official test-2026 split (BLEU = 9.42, ROUGE-L = 27.04, SARI = 55.42, BERTScore = 43.00, AlignScore = 25.28, MEDCON = 37.04). For Subtask 4, we develop a weighted ensemble of three retrieval methods - BM25 with relative thresholding, TF-IDF cosine similarity, and a fine-tuned cross-encoder - to identify note sentences supporting a given gold answer, achieving a micro-F1 of 67.16 on the 100-case test set. Experiments reveal that both subtasks expose the same fundamental challenge: 20 annotated training cases are insufficient to distinguish relevant from irrelevant clinical sentences, pointing to data augmentation as the highest-leverage future direction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This shared-task paper applies standard two-stage QLoRA and a retrieval ensemble to ArchEHR-QA, reports concrete test metrics, but leaves the contribution of the 20-case adaptation stage unverified.

read the letter

The key point is that this is a straightforward shared-task entry with no new algorithms. The authors fine-tune Qwen3-4B in two stages on the ArchEHR-QA subtasks and combine BM25, TF-IDF, and a cross-encoder for evidence alignment, then report an overall score of 32.87 on the official test-2026 split along with the usual automatic metrics. They also note that both subtasks suffer from the same limit: only 20 annotated cases are available for task-specific adaptation. That observation is useful for anyone working on this particular dataset. The paper does a decent job laying out the exact setup, the quantization details, and the ensemble weights, and the numbers are tied to the held-out official split rather than internal validation only. For the alignment subtask the micro-F1 of 67.16 is a clear baseline result. The soft spot is the missing check on whether the second-stage adaptation on those 20 cases actually improves generalization or simply risks fitting the tiny set. No ablation that removes the second stage, no learning-curve plot, and no separate validation split are described, so it remains unclear how much of the test performance comes from the first-stage domain exposure on the 30k emrQA samples versus the final 20-case step. The authors correctly flag data scarcity as the core issue, but without that control the claim stays partly assumptive. This work is mainly for teams already participating in the ArchEHR-QA task or running similar low-resource clinical QA experiments; a broader reader will find the numbers and the data-scarcity remark but little else that changes how they would approach the problem. It is worth sending to peer review for the task proceedings because the metrics are reproducible on the official split and the practical recipe is documented, though any referee will likely ask for the missing ablation.

Referee Report

2 major / 2 minor

Summary. The paper describes a two-stage QLoRA fine-tuning pipeline for Qwen3-4B (first on 30k emrQA-MedSQuAD samples for clinical domain exposure, then on the 20 ArchEHR-QA development cases for task style) that produces an overall score of 32.87 on the official 2026 test split for Subtask 3 (answer generation), together with a weighted ensemble of BM25, TF-IDF, and cross-encoder retrieval that reaches 67.16 micro-F1 on Subtask 4 (evidence alignment). The authors conclude that the 20 annotated cases are fundamentally insufficient to teach reliable distinction between relevant and irrelevant clinical sentences.

Significance. If the reported test-set numbers hold after verification of the second-stage contribution, the work supplies a concrete, reproducible baseline for low-resource clinical QA that combines large-scale domain adaptation with minimal task-specific tuning. The explicit reporting of six automatic metrics on the held-out official split and the identification of data scarcity as the primary bottleneck are useful for the shared-task community.

major comments (2)

[two-stage QLoRA fine-tuning procedure] The two-stage adaptation procedure (abstract and §3) reports test-2026 scores after fine-tuning on the 20 development cases but provides no ablation that removes the second stage, no learning-curve analysis, and no held-out validation split within the development data. Without these controls it is impossible to determine whether the observed performance stems from the 30k-sample domain exposure alone or from overfitting the 20 examples, directly undermining the claim that the second stage successfully teaches generalizable output style.
[Experiments and conclusion] The conclusion that '20 annotated training cases are insufficient' (abstract and final paragraph) rests on performance observed after adaptation on the single 20-case development set. No error analysis, confusion-matrix breakdown by sentence relevance, or comparison against a first-stage-only baseline is supplied, so the insufficiency claim lacks the quantitative support needed to guide future data-augmentation work.

minor comments (2)

[Method] Hyperparameter values for QLoRA rank, alpha, and the learning-rate schedule used in each stage are not listed; adding them would improve reproducibility.
[Subtask 4 ensemble] The ensemble weighting scheme for Subtask 4 is described only at a high level; the precise combination rule and any threshold values should be stated explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review. We appreciate the feedback on strengthening the experimental validation of our two-stage fine-tuning approach and the claims about data insufficiency. We will revise the manuscript accordingly to address these points.

read point-by-point responses

Referee: [two-stage QLoRA fine-tuning procedure] The two-stage adaptation procedure (abstract and §3) reports test-2026 scores after fine-tuning on the 20 development cases but provides no ablation that removes the second stage, no learning-curve analysis, and no held-out validation split within the development data. Without these controls it is impossible to determine whether the observed performance stems from the 30k-sample domain exposure alone or from overfitting the 20 examples, directly undermining the claim that the second stage successfully teaches generalizable output style.

Authors: We agree with this assessment. The current manuscript does not include the requested ablations or analyses, which limits the strength of our claims about the second stage. In the revised version, we will add: (1) results from a first-stage-only baseline on the test-2026 split, (2) a learning curve plotting performance metrics against the number of second-stage training examples (using subsets of the 20 cases), and (3) performance metrics using 4-fold cross-validation on the development set to simulate held-out validation. These additions will clarify the contribution of each stage and support the generalizability claim. revision: yes
Referee: [Experiments and conclusion] The conclusion that '20 annotated training cases are insufficient' (abstract and final paragraph) rests on performance observed after adaptation on the single 20-case development set. No error analysis, confusion-matrix breakdown by sentence relevance, or comparison against a first-stage-only baseline is supplied, so the insufficiency claim lacks the quantitative support needed to guide future data-augmentation work.

Authors: We concur that additional analyses are necessary to robustly support this conclusion. The manuscript currently infers insufficiency from the overall low performance scores, but lacks detailed breakdowns. We will revise by incorporating: an error analysis of the evidence alignment task with examples of misclassified sentences, a confusion matrix or precision/recall breakdown for relevant vs. irrelevant sentences, and explicit comparison to the first-stage-only model on Subtask 4. This will provide the quantitative backing and better guide future work on data augmentation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on held-out official test set

full rationale

The paper reports performance of a two-stage QLoRA fine-tuning procedure and a retrieval ensemble on the official ArchEHR-QA 2026 test split using standard automatic metrics (BLEU, ROUGE-L, etc.). No equations, derivations, or fitted parameters are defined inside the paper that reduce the reported scores to quantities constructed from the training inputs. The 20 development cases are used only for the second-stage adaptation; test metrics are computed externally and independently. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the central claims. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of clinical knowledge from the emrQA-MedSQuAD corpus via QLoRA and on the assumption that the 20 development cases suffice to learn the required output format.

free parameters (1)

QLoRA rank, alpha, and learning rates
Standard LoRA hyperparameters chosen for the two fine-tuning stages but not specified in the abstract.

axioms (1)

domain assumption QLoRA adaptation on a large clinical corpus transfers domain competence to a new task with only 20 additional examples
Invoked when the authors state that the first stage establishes clinical domain competence before the second stage.

pith-pipeline@v0.9.0 · 5577 in / 1322 out tokens · 40058 ms · 2026-05-15T00:55:01.916096+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 4 internal anchors

[1]

Introduction Electronic Health Records (EHRs) contain the au- thoritative account of a patient’s hospitalisation: diagnosis, treatment decisions, test results, and follow-upplans. Yetthisinformationremainslargely inaccessible to patients and caregivers because it is written in specialised medical language and em- bedded within lengthy structured documents...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Research Gap Despite growing interest in patient-oriented clinical QA, several gaps remain underexplored: First, while some prior systems have used open- weight models (Soni et al., 2025), the top-ranked systems in the 2025 shared task often relied on proprietary LLMs combined with few-shot learn- ing or synthetically generated examples (Yoadsanit et al.,...

work page 2025
[3]

Task The ArchEHR-QA Shared Task (Soni and Demner- Fushman, 2026b) formalises this challenge across two complementary subtasks.Subtask 3requires producing a factually grounded answer of at most 75 words in professional yet accessible language, given a de-identified MIMIC-III (Johnson et al.,

work page
[4]

Together, the two subtasks model the full information access pipeline: locate the evidence, then communicate it

and MIMIC-IV (Johnson et al., 2023) note ex- cerpt,apatientquestion,andaclinician-interpreted reformulation.Subtask4requiresidentifyingwhich note sentences constitute the supporting evidence foragivengoldanswer,evaluatedviamicro-F1over sentence-level binary citation decisions. Together, the two subtasks model the full information access pipeline: locate t...

work page 2023
[5]

Subtask 4 is eval- uated with standard information-retrieval metrics: micro-precision, micro-recall, and micro-F1 over sentence-level binary citation decisions

measure n-gram overlap with reference an- swers but do not directly reward simplification qual- ity or clinical faithfulness, SARI (Xu et al., 2016) specifically targets simplification by separately re- warding addition of plain-language terms, retention of source content, and deletion of unnecessary jargon, BERTScore (Zhang et al., 2020) provides semanti...

work page 2016
[6]

My doctor performed a cardiac catherization. Was this invasive, risky procedure necessary

Data 4.1. ArchEHR-QA Dataset The ArchEHR-QA dataset (Soni and Demner- Fushman, 2026a) is derived from de-identified MIMIC-III (Johnson et al., 2016) and MIMIC- IV (Johnson et al., 2023) discharge summaries. Each case comprises four components: a question submitted by a patient or family member through an online portal; a clinician-interpreted reformula- t...

work page 2016
[7]

thinking

System Description 5.1. Subtask 3: Answer Generation 5.1.1. Base Model We useQwen3-4B(Qwen Team, 2025), a 4-billion parameter instruction-tuned open-weight language model. Qwen3’s chain-of-thought “thinking” mode is disabled via the/no_think instruction token, preventing the model from prepending lengthy rea- soningtracesthatwouldconsumethe75-wordgen- era...

work page 2025
[8]

Subtask 3: Evaluation Results Table5reportsoursystem’sperformanceontheof- ficial test-2026 split (47 cases, IDs 121–167) along- side the top-ranked system

Experiments 6.1. Subtask 3: Evaluation Results Table5reportsoursystem’sperformanceontheof- ficial test-2026 split (47 cases, IDs 121–167) along- side the top-ranked system. Our system is compet- itive on surface n-gram metrics: BLEU within 0.50 points and ROUGE-L within 0.81 points of the top system. The largest gaps appear on AlignScore (−6.46) and MEDCO...

work page 2026
[9]

For Subtask 3, two-stage QLoRA fine- tuningofQwen3-4BonemrQA-MedSQuADand20 annotated development cases achieves an overall score of32.87on the test-2026 split

Conclusion WepresentedaunifiedsystemfortheArchEHR-QA Shared Task addressing both Subtask 3 (answer generation) and Subtask 4 (evidence sentence alignment). For Subtask 3, two-stage QLoRA fine- tuningofQwen3-4BonemrQA-MedSQuADand20 annotated development cases achieves an overall score of32.87on the test-2026 split. For Sub- task 4, a weighted ensemble of B...

work page 2026
[10]

References Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient fine- tuning of quantized LLMs. InAdvances in Neu- ral Information Processing Systems, volume 36. [arXiv] arXiv:2305.14314. Yu Gu, Robert Tinn, Hao Cheng, Michael Lu- cas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

LoRA: Low-Rank Adaptation of Large Language Models

Domain-specific language model pretrain- ing for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 3(1):2:1–2:23. [CrossRef] doi:10.1145/3458754. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adap- tation of large language models.arXiv...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3458754 2022
[12]

InProceedings of the Third Text REtrieval Conference (TREC-3), NIST Special Publication 500-225, pages 109–126

Okapi at TREC-3. InProceedings of the Third Text REtrieval Conference (TREC-3), NIST Special Publication 500-225, pages 109–126. Sarvesh Soni and Dina Demner-Fushman. 2026a. A dataset for addressing patient’s information needs related to clinical course of hospitalization. Scientific Data. Sarvesh Soni and Dina Demner-Fushman. 2026b. Overview of the arche...

work page doi:10.1162/tacl_a_00107 2026
[13]

BERTScore: Evaluating Text Generation with BERT

LAMAR at ArchEHR-QA 2025: Clini- cally aligned LLM-generated few-shot learning forEHR-groundedpatientquestionanswering. In Proceedings of the 24th Workshop on Biomedi- cal Language Processing (Shared Tasks), pages 96–103, Vienna, Austria. Association for Compu- tational Linguistics. Yuheng Zha, Yichi Yang, Ruichen Li, and Zhit- ing Hu. 2023. AlignScore: E...

work page internal anchor Pith review Pith/arXiv arXiv 2025