pith. machine review for the scientific record. sign in

arxiv: 2604.23801 · v1 · submitted 2026-04-26 · 💻 cs.CL · cs.IR

Recognition: unknown

Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:08 UTC · model grok-4.3

classification 💻 cs.CL cs.IR
keywords domain fine-tuningretrieval-augmented generationmedical question answeringMedQA-USMLElarge language models4B parameter scale
0
0 comments X

The pith

Domain fine-tuning outperforms retrieval-augmented generation for medical multiple-choice questions at the 4B-parameter scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two ways to add medical knowledge to a small 4B-parameter language model: fine-tuning it on domain data or adding retrieved medical passages to its input at inference time. It tests all combinations on the MedQA-USMLE benchmark using the same model size, prompt, and evaluation method. Fine-tuning improves majority-vote accuracy by 6.8 points over the base model, while retrieval adds no significant benefit and may even hurt the fine-tuned version slightly. This suggests that for this scale and task, knowledge baked into the model weights works better than knowledge supplied in the prompt context. The result helps decide resource allocation when deploying open small models in medicine.

Core claim

By holding model size, prompt, decoding, retrieval, and evaluation fixed and varying only domain adaptation and RAG presence, the experiment shows domain fine-tuning raises majority-vote accuracy from 46.4% to 53.3% on the 1,273-question MedQA-USMLE test set, a gain significant at p < 10^-4 by McNemar test, whereas RAG yields no significant improvement.

What carries the argument

The 2x2 controlled comparison of a general 4B model versus its domain-fine-tuned counterpart, each run with and without retrieved medical explanations from MedMCQA.

Load-bearing premise

That the chosen retrieval corpus and pipeline represent a fair implementation of RAG for this task.

What would settle it

Repeating the comparison with a stronger retrieval system such as dense vector search with reranking over a larger medical corpus and checking whether RAG then produces a statistically significant accuracy gain.

Figures

Figures reproduced from arXiv: 2604.23801 by Avi-ad Avraam Buskila.

Figure 1
Figure 1. Figure 1: Majority-vote accuracy with 95% confidence intervals. The +6 view at source ↗
Figure 2
Figure 2. Figure 2: Pairwise McNemar p-values across the four setups. The two non-significant cells are precisely the two RAG-toggle comparisons. Single corpus and embedding. We use one retrieval corpus (MedMCQA explanations) and one embedding model (nomic-embed-text). A different corpus (e.g., UpToDate, clinical guidelines, or MedQA-aligned textbooks) or a stronger embedder could change the RAG result. We mitigated obvious c… view at source ↗
Figure 3
Figure 3. Figure 3: Within-setup consistency across repetitions. view at source ↗
read the original abstract

Practitioners deploying small open-weight large language models (LLMs) for medical question answering face a recurring design choice: invest in a domain-fine-tuned model, or keep a general-purpose model and inject domain knowledge at inference time via retrieval-augmented generation (RAG). We isolate this trade-off by holding model size, prompt template, decoding temperature, retrieval pipeline, and evaluation protocol fixed, and varying only (i) whether the model has been domain-adapted (Gemma 3 4B vs. MedGemma 4B, both 4-bit quantized and served via Ollama) and (ii) whether retrieved passages from a medical knowledge corpus are inserted into the prompt. We evaluate all four cells of this 2x2 design on the full MedQA-USMLE 4-option test split (1,273 questions) with three repetitions per question (15,276 LLM calls). Domain fine-tuning yields a +6.8 percentage-point gain in majority-vote accuracy over the general 4B baseline (53.3% vs. 46.4%, McNemar p < 10^-4). RAG over MedMCQA explanations does not produce a statistically significant gain in either model, and in the domain-tuned model the point estimate is slightly negative (-1.9 pp, p = 0.16). At this scale and on this benchmark, domain knowledge encoded in weights dominates domain knowledge supplied in context. We release the full experiment code and JSONL traces to support replication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports a controlled 2×2 comparison of domain fine-tuning versus retrieval-augmented generation (RAG) for medical multiple-choice question answering using 4B-parameter models. Holding model size, prompt, temperature, and evaluation fixed, it evaluates Gemma-3-4B and MedGemma-4B with and without RAG over MedMCQA explanations on the full MedQA-USMLE test set (1,273 questions, 3 repetitions each). The key finding is that domain fine-tuning improves majority-vote accuracy by 6.8 percentage points (53.3% vs. 46.4%, McNemar p < 10^{-4}), while RAG yields no significant gain and a slight negative point estimate in the fine-tuned model.

Significance. If the results hold, the work supplies a clean empirical comparison showing that, at the 4B scale on MedQA-USMLE, domain adaptation via fine-tuning outperforms the tested form of in-context knowledge injection. Credit is due for the fully crossed design, three repetitions per item, McNemar testing, and public release of code plus JSONL traces. These elements make the measurements directly replicable and strengthen the practical takeaway for small-model deployment in medicine.

major comments (2)
  1. [Methods (RAG corpus and pipeline)] Methods (RAG corpus and pipeline): The central claim that 'domain knowledge encoded in weights dominates domain knowledge supplied in context' depends on the RAG condition producing no benefit. The retrieval corpus is restricted to explanations from the MedMCQA dataset. Because MedMCQA is a separate exam-style collection, its explanations may have limited topical overlap, depth, or lexical match with MedQA-USMLE items. Without an ablation using a broader corpus (e.g., PubMed or textbooks) or a stronger retriever, the null/negative RAG result may reflect a weak context-injection baseline rather than a general property of weights versus context. This is load-bearing for the dominance interpretation and the practitioner recommendation.
  2. [Results (§4) and Discussion] Results (§4) and Discussion: The paper reports point estimates and McNemar p-values for the four conditions but does not provide per-question error analysis or breakdown by medical topic. Such an analysis would clarify whether the fine-tuning advantage is concentrated in areas where MedMCQA explanations are least relevant, directly testing the corpus-overlap concern raised above.
minor comments (2)
  1. [Methods] The abstract states 'three repetitions per question (15,276 LLM calls)' but the methods should explicitly state the aggregation rule for majority vote (e.g., tie-breaking procedure) and confirm that the same seed or temperature settings were used across all cells.
  2. [Results] Table or figure presenting the four accuracy numbers should include both per-run accuracies and the majority-vote accuracies to allow readers to assess variance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the strengths of our fully crossed design, statistical testing, and reproducibility measures. We address each major comment below, indicating where revisions will be made.

read point-by-point responses
  1. Referee: Methods (RAG corpus and pipeline): The central claim that 'domain knowledge encoded in weights dominates domain knowledge supplied in context' depends on the RAG condition producing no benefit. The retrieval corpus is restricted to explanations from the MedMCQA dataset. Because MedMCQA is a separate exam-style collection, its explanations may have limited topical overlap, depth, or lexical match with MedQA-USMLE items. Without an ablation using a broader corpus (e.g., PubMed or textbooks) or a stronger retriever, the null/negative RAG result may reflect a weak context-injection baseline rather than a general property of weights versus context. This is load-bearing for the dominance interpretation and the practitioner recommendation.

    Authors: We agree that the RAG corpus (MedMCQA explanations) is a specific choice and that a broader corpus such as PubMed abstracts or medical textbooks might produce stronger context injection. Our experiment deliberately holds the retrieval pipeline, corpus, and prompt fixed to isolate the effect of domain fine-tuning versus this form of in-context augmentation. The reported result is therefore conditional on the tested RAG configuration: at the 4B scale, fine-tuning yields a statistically significant gain while the chosen RAG does not. We do not claim that no possible RAG setup could ever close the gap. In the revised manuscript we will add an explicit limitations paragraph in the Discussion clarifying the scope of the claim and recommending that future comparisons test stronger retrievers and corpora. This preserves the value of the controlled comparison while acknowledging the referee's valid point about generalizability. revision: partial

  2. Referee: Results (§4) and Discussion: The paper reports point estimates and McNemar p-values for the four conditions but does not provide per-question error analysis or breakdown by medical topic. Such an analysis would clarify whether the fine-tuning advantage is concentrated in areas where MedMCQA explanations are least relevant, directly testing the corpus-overlap concern raised above.

    Authors: We accept this recommendation. The released JSONL traces contain per-question predictions across all four conditions, making such an analysis feasible. In the revision we will add a new subsection in Results that (i) reports accuracy stratified by available MedQA-USMLE subject categories and (ii) examines whether the fine-tuning advantage is larger on questions whose retrieved MedMCQA explanations exhibit low lexical or embedding overlap with the question stem. This directly tests the corpus-overlap hypothesis and will be accompanied by a brief discussion of any patterns observed. Because the traces are already public, the additional analysis can be performed without new model calls. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with direct measurements

full rationale

The paper conducts a controlled 2x2 experiment measuring accuracy on MedQA-USMLE under fixed conditions, varying only domain adaptation and RAG presence. All reported gains, p-values, and conclusions are computed directly from the 15,276 LLM calls and majority votes; no equations, derivations, parameter fits, or predictions are defined in terms of the outputs. No self-citations are load-bearing, and the design contains no ansatz, uniqueness theorem, or renaming of prior results. The skeptic concern addresses experimental coverage rather than logical self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that MedQA-USMLE is a valid proxy for medical QA capability and that the retrieval pipeline is a reasonable test of RAG.

axioms (1)
  • domain assumption MedQA-USMLE is a valid and representative benchmark for medical multiple-choice question answering.
    The paper adopts the benchmark without additional validation or discussion of its limitations.

pith-pipeline@v0.9.0 · 5590 in / 1229 out tokens · 70159 ms · 2026-05-08T06:08:06.318113+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Chroma: the ai-native open-source embedding database

    Chroma. Chroma: the ai-native open-source embedding database. https://www.trychroma. com/, 2024

  2. [2]

    Dietterich

    Thomas G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms, 1998

  3. [3]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team and Google DeepMind. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

  4. [4]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

  5. [5]

    Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨ uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨ aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020. 7

  6. [6]

    Capabilities of GPT-4 on Medical Challenge Problems

    Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of GPT-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023

  7. [7]

    Morris, Brandon Duderstadt, and Andriy Mulyar

    Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. Nomic embed: Training a reproducible long context text embedder, 2024

  8. [8]

    Ollama: Get up and running with large language models locally

    Ollama Contributors. Ollama: Get up and running with large language models locally. https://ollama.com, 2024

  9. [9]

    MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning, pages 248–260, 2022

  10. [10]

    MedGemma Technical Report

    Andrew Sellergren et al. MedGemma technical report.arXiv preprint arXiv:2507.05201, 2025

  11. [11]

    Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

  12. [12]

    Augmenting black-box LLMs with medical textbooks for biomedical question answering.arXiv preprint arXiv:2309.02233, 2024

    Yubo Wang, Xueguang Ma, and Wenhu Chen. Augmenting black-box LLMs with medical textbooks for biomedical question answering.arXiv preprint arXiv:2309.02233, 2024

  13. [13]

    Benchmarking retrieval-augmented generation for medicine.arXiv preprint arXiv:2402.13178, 2024

    Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. Benchmarking retrieval- augmented generation for medicine.arXiv preprint arXiv:2402.13178, 2024. 8