Recognition: no theorem link
QU-NLP at ArchEHR-QA 2026: Two-Stage QLoRA Fine-Tuning of Qwen3-4B for Patient-Oriented Clinical Question Answering and Evidence Sentence Alignment
Pith reviewed 2026-05-15 00:55 UTC · model grok-4.3
The pith
Two-stage QLoRA fine-tuning of Qwen3-4B on 30,000 clinical samples then 20 cases produces an overall score of 32.87 for patient question answering while exposing data scarcity as the core limit.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A two-stage QLoRA fine-tuning procedure applied to Qwen3-4B first establishes clinical domain competence on 30,000 emrQA-MedSQuAD samples and then adapts to the specific answer generation style using only 20 development cases. This produces an overall score of 32.87 on the test-2026 split, with component scores BLEU 9.42, ROUGE-L 27.04, SARI 55.42, BERTScore 43.00, AlignScore 25.28 and MEDCON 37.04. For evidence sentence alignment a weighted ensemble of BM25, TF-IDF cosine similarity and a fine-tuned cross-encoder reaches micro-F1 of 67.16. The experiments indicate that twenty annotated cases are too few to teach the model how to distinguish relevant from irrelevant clinical sentences.
What carries the argument
Two-stage QLoRA adaptation of Qwen3-4B for clinical QA generation combined with an ensemble retriever for evidence alignment.
If this is right
- The system generates answers scoring BLEU 9.42, ROUGE-L 27.04, SARI 55.42, BERTScore 43.00, AlignScore 25.28 and MEDCON 37.04 on the unseen test set.
- Evidence sentence alignment reaches micro-F1 of 67.16 on the 100-case test set using the weighted ensemble of BM25, TF-IDF and cross-encoder.
- Both answer generation and evidence alignment are limited by the inability to distinguish relevant from irrelevant clinical sentences when only 20 annotated cases are available.
- Data augmentation is identified as the highest-leverage next step to overcome the data scarcity that affects both subtasks.
Where Pith is reading between the lines
- Scaling the number of annotated clinical cases would likely improve the model's ability to select relevant evidence sentences.
- The two-stage domain-then-task adaptation pattern could be tested on other low-resource medical text tasks.
- Jointly training generation and alignment objectives in one model might reduce dependence on large annotated sets.
Load-bearing premise
That initial training on 30,000 general clinical QA examples followed by adaptation on only 20 specific cases is enough to create a model that generalizes to new patient questions and correctly identifies supporting sentences.
What would settle it
Training a version of the model without the first-stage clinical pre-adaptation and checking whether it matches or exceeds the reported test scores; or measuring whether adding more than 20 annotated cases improves the separation of relevant and irrelevant sentences.
Figures
read the original abstract
We present a unified system addressing both Subtask 3 (answer generation) and Subtask 4 (evidence sentence alignment) of the ArchEHR-QA Shared Task. For Subtask 3, we apply two-stage Quantised Low-Rank Adaptation (QLoRA) to Qwen3-4B loaded in 4-bit NF4 quantisation: first on 30,000 samples from the emrQA-MedSQuAD corpus to establish clinical domain competence, then on the 20 annotated development cases to learn the task-specific output style. Our system achieves an overall score of 32.87 on the official test-2026 split (BLEU = 9.42, ROUGE-L = 27.04, SARI = 55.42, BERTScore = 43.00, AlignScore = 25.28, MEDCON = 37.04). For Subtask 4, we develop a weighted ensemble of three retrieval methods - BM25 with relative thresholding, TF-IDF cosine similarity, and a fine-tuned cross-encoder - to identify note sentences supporting a given gold answer, achieving a micro-F1 of 67.16 on the 100-case test set. Experiments reveal that both subtasks expose the same fundamental challenge: 20 annotated training cases are insufficient to distinguish relevant from irrelevant clinical sentences, pointing to data augmentation as the highest-leverage future direction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper describes a two-stage QLoRA fine-tuning pipeline for Qwen3-4B (first on 30k emrQA-MedSQuAD samples for clinical domain exposure, then on the 20 ArchEHR-QA development cases for task style) that produces an overall score of 32.87 on the official 2026 test split for Subtask 3 (answer generation), together with a weighted ensemble of BM25, TF-IDF, and cross-encoder retrieval that reaches 67.16 micro-F1 on Subtask 4 (evidence alignment). The authors conclude that the 20 annotated cases are fundamentally insufficient to teach reliable distinction between relevant and irrelevant clinical sentences.
Significance. If the reported test-set numbers hold after verification of the second-stage contribution, the work supplies a concrete, reproducible baseline for low-resource clinical QA that combines large-scale domain adaptation with minimal task-specific tuning. The explicit reporting of six automatic metrics on the held-out official split and the identification of data scarcity as the primary bottleneck are useful for the shared-task community.
major comments (2)
- [two-stage QLoRA fine-tuning procedure] The two-stage adaptation procedure (abstract and §3) reports test-2026 scores after fine-tuning on the 20 development cases but provides no ablation that removes the second stage, no learning-curve analysis, and no held-out validation split within the development data. Without these controls it is impossible to determine whether the observed performance stems from the 30k-sample domain exposure alone or from overfitting the 20 examples, directly undermining the claim that the second stage successfully teaches generalizable output style.
- [Experiments and conclusion] The conclusion that '20 annotated training cases are insufficient' (abstract and final paragraph) rests on performance observed after adaptation on the single 20-case development set. No error analysis, confusion-matrix breakdown by sentence relevance, or comparison against a first-stage-only baseline is supplied, so the insufficiency claim lacks the quantitative support needed to guide future data-augmentation work.
minor comments (2)
- [Method] Hyperparameter values for QLoRA rank, alpha, and the learning-rate schedule used in each stage are not listed; adding them would improve reproducibility.
- [Subtask 4 ensemble] The ensemble weighting scheme for Subtask 4 is described only at a high level; the precise combination rule and any threshold values should be stated explicitly.
Simulated Author's Rebuttal
Thank you for the detailed review. We appreciate the feedback on strengthening the experimental validation of our two-stage fine-tuning approach and the claims about data insufficiency. We will revise the manuscript accordingly to address these points.
read point-by-point responses
-
Referee: [two-stage QLoRA fine-tuning procedure] The two-stage adaptation procedure (abstract and §3) reports test-2026 scores after fine-tuning on the 20 development cases but provides no ablation that removes the second stage, no learning-curve analysis, and no held-out validation split within the development data. Without these controls it is impossible to determine whether the observed performance stems from the 30k-sample domain exposure alone or from overfitting the 20 examples, directly undermining the claim that the second stage successfully teaches generalizable output style.
Authors: We agree with this assessment. The current manuscript does not include the requested ablations or analyses, which limits the strength of our claims about the second stage. In the revised version, we will add: (1) results from a first-stage-only baseline on the test-2026 split, (2) a learning curve plotting performance metrics against the number of second-stage training examples (using subsets of the 20 cases), and (3) performance metrics using 4-fold cross-validation on the development set to simulate held-out validation. These additions will clarify the contribution of each stage and support the generalizability claim. revision: yes
-
Referee: [Experiments and conclusion] The conclusion that '20 annotated training cases are insufficient' (abstract and final paragraph) rests on performance observed after adaptation on the single 20-case development set. No error analysis, confusion-matrix breakdown by sentence relevance, or comparison against a first-stage-only baseline is supplied, so the insufficiency claim lacks the quantitative support needed to guide future data-augmentation work.
Authors: We concur that additional analyses are necessary to robustly support this conclusion. The manuscript currently infers insufficiency from the overall low performance scores, but lacks detailed breakdowns. We will revise by incorporating: an error analysis of the evidence alignment task with examples of misclassified sentences, a confusion matrix or precision/recall breakdown for relevant vs. irrelevant sentences, and explicit comparison to the first-stage-only model on Subtask 4. This will provide the quantitative backing and better guide future work on data augmentation. revision: yes
Circularity Check
No circularity: empirical results on held-out official test set
full rationale
The paper reports performance of a two-stage QLoRA fine-tuning procedure and a retrieval ensemble on the official ArchEHR-QA 2026 test split using standard automatic metrics (BLEU, ROUGE-L, etc.). No equations, derivations, or fitted parameters are defined inside the paper that reduce the reported scores to quantities constructed from the training inputs. The 20 development cases are used only for the second-stage adaptation; test metrics are computed externally and independently. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the central claims. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- QLoRA rank, alpha, and learning rates
axioms (1)
- domain assumption QLoRA adaptation on a large clinical corpus transfers domain competence to a new task with only 20 additional examples
Reference graph
Works this paper leans on
-
[1]
Introduction Electronic Health Records (EHRs) contain the au- thoritative account of a patient’s hospitalisation: diagnosis, treatment decisions, test results, and follow-upplans. Yetthisinformationremainslargely inaccessible to patients and caregivers because it is written in specialised medical language and em- bedded within lengthy structured documents...
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Research Gap Despite growing interest in patient-oriented clinical QA, several gaps remain underexplored: First, while some prior systems have used open- weight models (Soni et al., 2025), the top-ranked systems in the 2025 shared task often relied on proprietary LLMs combined with few-shot learn- ing or synthetically generated examples (Yoadsanit et al.,...
work page 2025
-
[3]
Task The ArchEHR-QA Shared Task (Soni and Demner- Fushman, 2026b) formalises this challenge across two complementary subtasks.Subtask 3requires producing a factually grounded answer of at most 75 words in professional yet accessible language, given a de-identified MIMIC-III (Johnson et al.,
-
[4]
and MIMIC-IV (Johnson et al., 2023) note ex- cerpt,apatientquestion,andaclinician-interpreted reformulation.Subtask4requiresidentifyingwhich note sentences constitute the supporting evidence foragivengoldanswer,evaluatedviamicro-F1over sentence-level binary citation decisions. Together, the two subtasks model the full information access pipeline: locate t...
work page 2023
-
[5]
measure n-gram overlap with reference an- swers but do not directly reward simplification qual- ity or clinical faithfulness, SARI (Xu et al., 2016) specifically targets simplification by separately re- warding addition of plain-language terms, retention of source content, and deletion of unnecessary jargon, BERTScore (Zhang et al., 2020) provides semanti...
work page 2016
-
[6]
My doctor performed a cardiac catherization. Was this invasive, risky procedure necessary
Data 4.1. ArchEHR-QA Dataset The ArchEHR-QA dataset (Soni and Demner- Fushman, 2026a) is derived from de-identified MIMIC-III (Johnson et al., 2016) and MIMIC- IV (Johnson et al., 2023) discharge summaries. Each case comprises four components: a question submitted by a patient or family member through an online portal; a clinician-interpreted reformula- t...
work page 2016
-
[7]
System Description 5.1. Subtask 3: Answer Generation 5.1.1. Base Model We useQwen3-4B(Qwen Team, 2025), a 4-billion parameter instruction-tuned open-weight language model. Qwen3’s chain-of-thought “thinking” mode is disabled via the/no_think instruction token, preventing the model from prepending lengthy rea- soningtracesthatwouldconsumethe75-wordgen- era...
work page 2025
-
[8]
Experiments 6.1. Subtask 3: Evaluation Results Table5reportsoursystem’sperformanceontheof- ficial test-2026 split (47 cases, IDs 121–167) along- side the top-ranked system. Our system is compet- itive on surface n-gram metrics: BLEU within 0.50 points and ROUGE-L within 0.81 points of the top system. The largest gaps appear on AlignScore (−6.46) and MEDCO...
work page 2026
-
[9]
Conclusion WepresentedaunifiedsystemfortheArchEHR-QA Shared Task addressing both Subtask 3 (answer generation) and Subtask 4 (evidence sentence alignment). For Subtask 3, two-stage QLoRA fine- tuningofQwen3-4BonemrQA-MedSQuADand20 annotated development cases achieves an overall score of32.87on the test-2026 split. For Sub- task 4, a weighted ensemble of B...
work page 2026
-
[10]
References Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient fine- tuning of quantized LLMs. InAdvances in Neu- ral Information Processing Systems, volume 36. [arXiv] arXiv:2305.14314. Yu Gu, Robert Tinn, Hao Cheng, Michael Lu- cas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
LoRA: Low-Rank Adaptation of Large Language Models
Domain-specific language model pretrain- ing for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 3(1):2:1–2:23. [CrossRef] doi:10.1145/3458754. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adap- tation of large language models.arXiv...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3458754 2022
-
[12]
Okapi at TREC-3. InProceedings of the Third Text REtrieval Conference (TREC-3), NIST Special Publication 500-225, pages 109–126. Sarvesh Soni and Dina Demner-Fushman. 2026a. A dataset for addressing patient’s information needs related to clinical course of hospitalization. Scientific Data. Sarvesh Soni and Dina Demner-Fushman. 2026b. Overview of the arche...
-
[13]
BERTScore: Evaluating Text Generation with BERT
LAMAR at ArchEHR-QA 2025: Clini- cally aligned LLM-generated few-shot learning forEHR-groundedpatientquestionanswering. In Proceedings of the 24th Workshop on Biomedi- cal Language Processing (Shared Tasks), pages 96–103, Vienna, Austria. Association for Compu- tational Linguistics. Yuheng Zha, Yichi Yang, Ruichen Li, and Zhit- ing Hu. 2023. AlignScore: E...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.