arxiv: 2605.03618 · v1 · submitted 2026-05-05 · 💻 cs.CL

Recognition: unknown

BIT.UA-AAUBS at ArchEHR-QA 2026: Evaluating Open-Source and Proprietary LLMs via Prompting in Low-Resource QA

Richard A. A. Jonker , Alexander Christiansen , Alexandros Maniatis , R\'uben Garrido , Rog\'erio Braunschweiger de Freitas Lima , Roman Jurowetzki , S\'ergio Matos

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords clinical question answeringprompt engineeringlarge language modelsopen-source modelslow-resource settingsevidence groundingshared task

0 comments

The pith

Domain-adapted open-source LLMs can match proprietary models in clinical question answering through optimized prompt engineering in low-resource settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates several proprietary and open-source LLMs for the ArchEHR-QA 2026 shared task on clinical QA and evidence grounding under strict privacy constraints that prevent training. Using only prompt engineering techniques such as Chain-of-Thought and in-context learning, the team demonstrates that proprietary models are resilient to prompt variations while domain-adapted open-source models like MedGemma 3 27B reach highly competitive performance with the right prompt design. Their prompt-based system secured first place in evidence citation alignment and third place in patient-friendly answer generation. This highlights the viability of prompting approaches for healthcare applications where data access is limited.

Core claim

Without access to training data or the ability to update model weights due to privacy regulations, carefully designed prompts applied to domain-adapted open-source LLMs enable performance that rivals or approaches that of proprietary models in clinical question answering tasks, as shown by leading results in key subtasks of the shared task.

What carries the argument

Prompt engineering strategies including task decomposition, Chain-of-Thought, in-context learning, majority voting, and LLM-as-a-judge ensembling, applied without any fine-tuning or weight updates.

If this is right

Proprietary LLMs maintain strong performance across different prompt formulations in clinical domains.
Domain-adapted open-source models become viable alternatives when paired with optimized prompts.
Ensembling techniques such as majority voting and LLM judging enhance the robustness of predictions.
Prompt-only methods allow effective solutions in data-scarce and privacy-sensitive healthcare QA scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar prompting techniques may apply to other domains with regulatory data restrictions, reducing dependence on proprietary services.
The success of domain-adapted bases suggests that pre-adaptation is a key enabler for open-source competitiveness.
Automatic metrics from the shared task may need supplementation with human expert review to confirm clinical safety.
Future work could test these prompts on diverse clinical datasets beyond the shared task.

Load-bearing premise

That the shared task test set and automatic metrics accurately reflect real-world clinical performance and that LLM outputs contain no medically harmful inaccuracies or hallucinations.

What would settle it

Conducting a blind evaluation by medical experts on the generated answers for factual accuracy, safety, and alignment with patient records would reveal if the competitive performance translates to reliable clinical use.

Figures

Figures reproduced from arXiv: 2605.03618 by Alexander Christiansen, Alexandros Maniatis, Richard A. A. Jonker, Rog\'erio Braunschweiger de Freitas Lima, Roman Jurowetzki, R\'uben Garrido, S\'ergio Matos.

**Figure 1.** Figure 1: Subtask 1 Validation Results showing the different prompts over several different open and close source models. The scores represent the official evaluation metrics of the average of ROUGE, BERTScore, AlignScore, and MEDCON. on our part. Model / Team Rank Score Our Submissions Sonnet-4.5 Prompt 10 13 19.0 GPT-5.2 Prompt 8 – 16.6 Leaderboard Best Competitor 1 31.2 Median 7 25.6 view at source ↗

**Figure 2.** Figure 2: Subtask 2 Validation Results over the Strict Micro F1 metric Consequently, our final system relied on an ensemble of the strongest proprietary models. The primary submission, a majority voting ensemble comprising Gemini 3 Flash Preview, Grok 4.1 Fast, and Claude Sonnet 4.5 (all using prompt 10), achieved a Strict Micro F1 score of 58.8, ranking 11th on the leaderboard ( view at source ↗

**Figure 3.** Figure 3: Subtask 3 Validation Results using an av view at source ↗

**Figure 4.** Figure 4: Subtask 4 Validation Results To maximize robustness, we utilized an ensemble search over the strongest model-prompt combinations (Prompts 5-7), with the final submissions using Prompts 5 and 6. The final ensemble composition was selected based solely on Micro F1 performance on the 20-case development set, with no subsequent adjustment made prior to or following test set submission. Our primary submissi… view at source ↗

**Figure 5.** Figure 5: Subtask 1 Validation Results (All Models) view at source ↗

**Figure 6.** Figure 6: Subtask 3 Validation Results (All Models) view at source ↗

**Figure 7.** Figure 7: Subtask 4 Validation Results (All Models) view at source ↗

read the original abstract

This paper presents the joint participation of the BIT.UA and AAUBS groups in the ArchEHR-QA 2026 shared task, which focuses on clinical question answering and evidence grounding in a low-resource setting. Due to the absence of training data and the strict data privacy constraints inherent to the healthcare domain (e.g. GDPR), we investigate the capabilities of Large Language Models (LLMs) without weight updates. We evaluate several state-of-the-art proprietary models and locally deployable open-source alternatives using various prompt engineering strategies, including task decomposition, Chain-of-Thought, and in-context learning. Furthermore, we explore majority voting and LLM-as-a-judge ensembling techniques to maximize predictive robustness. Our results demonstrate that while proprietary models exhibit strong resilience to prompt variations, domain-adapted open-source models (such as MedGemma 3 27B) achieve highly competitive performance when paired with the right prompt. Overall, our prompt-based approach proved highly effective, securing 1st place in Subtask 4 (evidence citation alignment) and 3rd place in Subtask 3 (patient-friendly answer generation). All code, results, and prompts are available on our GitHub repository: https://github.com/bioinformatics-ua/ArchEHR-QA-2026.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clean systems report on shared-task participation with standard LLM prompting; they placed well but introduce no new methods or general findings.

read the letter

The main point is that this paper documents one team's entry in the ArchEHR-QA 2026 shared task on clinical QA and evidence grounding. They applied off-the-shelf prompting techniques—chain-of-thought, in-context examples, task decomposition, majority voting, and LLM-as-judge—to both proprietary models and open-source ones like MedGemma 3 27B, all without weight updates because of privacy rules. They report first place in evidence citation alignment and third in patient-friendly answer generation, plus some observations on prompt resilience. Code and prompts are on GitHub, which is useful for anyone wanting to replicate the setup.

Referee Report

2 major / 2 minor

Summary. This paper reports the joint participation of the BIT.UA and AAUBS teams in the ArchEHR-QA 2026 shared task on clinical question answering and evidence grounding in a low-resource setting. Due to privacy constraints, the authors rely on prompting strategies with both proprietary and open-source LLMs, including task decomposition, Chain-of-Thought, in-context learning, majority voting, and LLM-as-a-judge ensembling. They highlight the resilience of proprietary models to prompt variations and the competitiveness of domain-adapted open-source models like MedGemma 3 27B, achieving 1st place in Subtask 4 (evidence citation alignment) and 3rd place in Subtask 3 (patient-friendly answer generation).

Significance. If the reported rankings and observations hold, this work demonstrates the viability of prompt engineering for clinical QA tasks where fine-tuning is prohibited by data privacy regulations. It provides evidence that carefully designed prompts can enable open-source models to compete with proprietary ones in specialized domains, which is valuable for reproducible and accessible AI in healthcare. The public release of code, results, and prompts further enhances its utility for the community.

major comments (2)

Abstract: the assertion that the prompt-based approach 'proved highly effective' and that domain-adapted models 'achieve highly competitive performance' is not accompanied by any numerical scores, ablation studies, or error analysis within the manuscript, forcing readers to consult the external leaderboard or GitHub for verification.
The manuscript provides no detailed description of the exact prompt templates, the specific configurations for majority voting or LLM-as-a-judge ensembling, or the full list of evaluated models beyond the single example of MedGemma 3 27B; these omissions undermine the ability to reproduce or extend the comparative observations on prompt resilience.

minor comments (2)

Abstract: the title references evaluation of 'Open-Source and Proprietary LLMs' but the text does not enumerate all models tested or provide a summary table of their relative performances across subtasks.
Consider including a dedicated results section or table that reports the official shared-task metrics for each subtask and model-prompt combination to improve clarity and allow direct comparison with other participants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below and will incorporate changes to improve clarity and reproducibility.

read point-by-point responses

Referee: Abstract: the assertion that the prompt-based approach 'proved highly effective' and that domain-adapted models 'achieve highly competitive performance' is not accompanied by any numerical scores, ablation studies, or error analysis within the manuscript, forcing readers to consult the external leaderboard or GitHub for verification.

Authors: We agree that the abstract would be strengthened by direct numerical support. While the manuscript body reports our shared-task rankings and comparative observations, we will revise the abstract to include key performance scores for Subtasks 3 and 4. This will allow readers to assess effectiveness without external lookup. Ablation studies and error analyses appear in the results section; we will ensure they are clearly cross-referenced from the abstract in the revision. revision: yes
Referee: The manuscript provides no detailed description of the exact prompt templates, the specific configurations for majority voting or LLM-as-a-judge ensembling, or the full list of evaluated models beyond the single example of MedGemma 3 27B; these omissions undermine the ability to reproduce or extend the comparative observations on prompt resilience.

Authors: The manuscript already states that all prompts, code, results, and model lists are publicly released on GitHub to support full reproducibility. To further address self-containment, we will add an appendix in the revised version that includes representative prompt templates, exact configurations for majority voting and LLM-as-a-judge ensembling, and the complete list of evaluated models (both proprietary and open-source). revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a systems report documenting participation in the ArchEHR-QA 2026 shared task. It describes empirical evaluation of LLMs using prompting techniques (task decomposition, Chain-of-Thought, in-context learning, majority voting, LLM-as-a-judge) in a low-resource setting with no training data or weight updates. Central claims consist of observed performance rankings (1st in Subtask 4, 3rd in Subtask 3) and comparative statements about prompt resilience and model competitiveness, all directly verifiable against the official shared-task leaderboard and metrics. No equations, derivations, fitted parameters, or load-bearing self-citations appear; the work contains no predictive modeling or theoretical chain that could reduce to its own inputs by construction. The absence of any such structure makes circularity impossible.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, mathematical axioms, or invented entities are introduced; the work relies entirely on existing LLMs and established prompting practices evaluated on a shared task benchmark.

pith-pipeline@v0.9.0 · 5582 in / 1176 out tokens · 64440 ms · 2026-05-07T16:40:07.738414+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 5 canonical work pages · 4 internal anchors

[1]

BIT.UA-AAUBS at ArchEHR-QA 2026: Evaluating Open-Source and Proprietary LLMs via Prompting in Low-Resource QA

Introduction Patients increasingly seek to understand their health conditions and clinical course by review- ing their electronic health records (EHRs). How- ever,clinicalnotesarenotoriouslycomplex,lengthy, and filled with medical jargon, making it difficult for patients to extract clear, accurate answers to their questions. The ArchEHR-QA 2026 (Soni and ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

LLM-as-a-judge

Background Large Language Models in Clinical NLP.Until re- cently, state-of-the-art clinical natural language pro- cessing relied heavily on domain-specific, encoder- only architectures, such as ClinicalBERT (Huang et al., 2019), which required extensive supervised fine-tuning. Recently, the paradigm has shifted to- ward generative LLMs. Models such as GP...

2019
[3]

mini- malandsufficient

Methodology Given the extreme low-resource constraints of this competition, comprising a development set of only 20 samples, our methodology strictly uti- lizespromptengineeringovertraditionalfine-tuning. Across the pipeline, each of the four Subtasks in- corporates an LLM component. To quantify the performance gap between state-of-the-art propri- etary m...

2000
[4]

Results We employed a two-stage evaluation methodology across all four Subtasks. In the initial stage, we conducted extensive validation on the 20-case de- velopment set, evaluating a representative, though non-exhaustive, pool of state-of-the-art proprietary models (e.g., Gemini 3 Flash (Doshi and The Gem- ini Team, 2025), Gemini 2.5 Flash (Comanici et a...

2025
[5]

Why is he not eatin feeling weak and shakey loosin weight

Error Analysis Subtask 1: Question Interpretation Error Analy- sis.The model occasionally failed to transform pa- tient narratives into professional clinical queries. In Case 129, it generated an informal, query with lots of spelling errors (“Why is he not eatin feeling weak and shakey loosin weight...”) rather than the re- quired concise clinical formula...
[6]

Our findings high- lightseveralkeydynamicsregardingmodelscaling, open-source viability, and the practical trade-offs of ensemble methodologies in clinical applications

Discussion In this work, we aimed to evaluate the limits of LLMs in the extreme low-resource biomedical set- ting of the ArchEHR competition, relying primarily on prompt engineering and model selection rather than supervised fine-tuning. Our findings high- lightseveralkeydynamicsregardingmodelscaling, open-source viability, and the practical trade-offs of...

2024
[7]

Conclusion This study investigated the efficacy of LLMs in the extreme low-resource clinical setting of the ArchEHRcompetition. Bysystematicallyevaluating zero-shot, few-shot, and CoT prompting strategies alongside ensemble methodologies, we demon- strated that state-of-the-art LLMs can achieve highly competitive performance across complex biomedical NLP ...
[8]

Limitations While our work showed good results, this study has several notable limitations. First, due to the ex- treme low-resource nature of the shared task (only 20 development cases), there is a persistent risk that our few-shot prompts and ensemble configu- rations are partially overfit to the development dis- tribution, despite our efforts to utiliz...
[9]

human-in-the-loop

Ethical Considerations The deployment of generative LLMs in clinical set- tingscarriesprofoundethicalimplications, primarily concerning patient safety and data privacy. Gen- erative models are inherently prone to hallucina- tion. While our system achieved high precision in evidence alignment (Subtask 4) and grounding (Subtask 3), the risk of clinical hall...
[10]

Richard A

Acknowledgments This work was funded by FEDER - Fundo Europeu de Desenvolvimento Regional funds through Programa Regional do Centro, within project CENTRO2030-FEDER-02595400 and by the Foundation for Science and Tech- nology (FCT) through the contract https: //doi.org/10.54499/UID/00127/2025. Richard A. A. Jonker is funded by the FCT doctoral grant PRT/BD...

work page doi:10.54499/uid/00127/2025 2025
[11]

Bibliographical References 104th United States Congress. 1996. Health Insur- ance Portability and Accountability Act of 1996 (HIPAA). Public Law 104-191, 110 Stat. 1936. Anthropic. 2025. Introducing claude sonnet 4.5. Anthropic News. Accessed: 2026-03-10. TomBBrown,BenjaminMann,NickRyder,Melanie Subbiah,JaredKaplan,PrafullaDhariwal,Arvind Neelakantan, Pra...

work page internal anchor Pith review arXiv 1996
[12]

Qwen3 Technical Report

Tuning language models as training data generators for augmentation-enhanced few-shot learning. InInternational Conference on Machine Learning, pages 24457–24477. PMLR. OpenAI. 2025. Introducing gpt-4.1 in the api. Ope- nAI Blog. Accessed: 2026-03-10. Qwen Team. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388. PranabSahoo,AyushKumarSingh,Srip...

work page internal anchor Pith review arXiv 2025
[13]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

A systematic survey of prompt engineer- ing in large language models: Techniques and applications.arXiv preprint arXiv:2402.07927, 1. Maximilian Schall and Gerard de Melo. 2025. The hiddencostofstructure: Howconstraineddecod- ing affects language model performance. InPro- ceedings of the 15th International Conference on Recent Advances in Natural Language...

work page internal anchor Pith review arXiv 2025