arxiv: 2605.00421 · v2 · submitted 2026-05-01 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI

Pankaj Gupta , Kartik Bose

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords small language modelsLoRA fine-tuningradiology AICPU deploymentmulti-task learningmedical NLPparameter-efficient tuningclinical AI assistants

0 comments

The pith

Small language models fine-tuned with LoRA achieve strong multi-task radiology performance and run on consumer CPUs without GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether language models with only 3 to 4 billion parameters can become useful radiology assistants after efficient fine-tuning. The authors compile 162,000 training samples from twelve public datasets that cover nine separate tasks including report classification, impression generation, natural language inference, and staging. Both tested models show large gains over their zero-shot versions, with one model stronger at generating structured text and the other stronger at extracting information. After quantization the models fit in under 2.5 GB and run at usable speeds on ordinary laptop CPUs. A reader would care because this removes the need for expensive GPUs or cloud services in clinical settings where hardware is limited.

Core claim

LoRA fine-tuning of the Qwen2.5-3B-Instruct and Qwen3-4B models on 162K radiology samples produces accuracy gains of 53 percent on RADS classification, 60 percent on natural language inference, and 89 percent on N-staging relative to zero-shot baselines. The two models display complementary strengths that an oracle ensemble exploits for best results across all tasks. Fine-tuned models can be converted to GGUF format and run at 4-8 tokens per second on consumer CPUs, while few-shot prompting after fine-tuning actually lowers performance.

What carries the argument

LoRA fine-tuning applied to 3-4B parameter language models to adapt them simultaneously to nine radiology tasks, followed by quantization for CPU inference.

If this is right

The fine-tuned models can be deployed in clinics that lack GPU hardware or internet access.
Combining the two models via an ensemble yields the highest scores on every task.
Parameter-efficient adaptation outperforms in-context few-shot examples for these specialized medical tasks.
Quantized models occupy roughly 2 GB and deliver inference speeds sufficient for interactive use on laptops.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same compilation and fine-tuning approach could be applied to other medical report domains such as pathology or cardiology with only modest additional data collection.
Running inference entirely on local hardware reduces the risk of sending protected patient data to external servers.
Future experiments could test whether even smaller models under 2B parameters retain acceptable performance after similar training.

Load-bearing premise

The 162K samples drawn from twelve public datasets and the up-to-500 held-out test samples per task reflect the variety and difficulty of real clinical radiology reports and images.

What would settle it

Running the same models on a fresh set of 500 real hospital radiology reports and images that were never part of any public dataset and finding accuracy below 60 percent on at least three of the nine tasks would show the results do not generalize.

Figures

Figures reproduced from arXiv: 2605.00421 by Kartik Bose, Pankaj Gupta.

**Figure 1.** Figure 1: Training data overview: 162K samples across 9 radiology tasks and 12 public datasets. (a) view at source ↗

**Figure 2.** Figure 2: Zero-shot vs. fine-tuned performance across 9 radiology tasks for Qwen2.5-3B (blue) and view at source ↗

**Figure 3.** Figure 3: Radar plot comparing fine-tuned Qwen2.5-3B and Qwen3-4B across 9 tasks. Each axis rep view at source ↗

**Figure 4.** Figure 4: Per-RADS system accuracy heatmap comparing Qwen2.5-3B (left) and Qwen3-4B (right). view at source ↗

**Figure 5.** Figure 5: Impact of few-shot prompting on fine-tuned Qwen3-4B RADS accuracy. Bars show per view at source ↗

**Figure 6.** Figure 6: Clinical severity analysis of RADS classification errors. Left: error direction distribution view at source ↗

**Figure 7.** Figure 7: Deployment tradeoff: model size (GGUF Q4 view at source ↗

read the original abstract

Large language models (LLMs) show promise in radiology but their deployment is limited by computational requirements that preclude use in resource-constrained clinical environments. We investigate whether small language models (SLMs) of 3-4 billion parameters can achieve strong multi-task radiology performance through LoRA fine-tuning, enabling deployment on consumer-grade CPUs. We train Qwen2.5-3B-Instruct and Qwen3-4B on 162K samples spanning 9 radiology tasks - RADS classification across 10 systems, impression generation, temporal comparison, radiology NLI, NER, abnormality detection, N/M staging, and radiology Q&A - compiled from 12 public datasets. Both models are evaluated on up to 500 held-out test samples per task with standardized metrics. Our key findings are: (1) LoRA fine-tuning dramatically improves performance over zero-shot baselines (RADS accuracy +53%, NLI +60%, N-staging +89%); (2) the two models exhibit complementary strengths - Qwen2.5 excels at structured generation tasks while Qwen3 dominates extractive tasks; (3) a task-outed oracle ensemble combining both models achieves the best performance across all tasks; (4) few-shot prompting with fine-tuned models hurts performance, demonstrating that LoRA adaptation is more effective than in-context learning for specialized domains; and (5) models can be quantized to GGUF format (~1.8-2.4GB) for CPU deployment at 4-8 tokens/second on consumer hardware. Our work demonstrates that small, efficiently fine-tuned models - which we collectively call RadLite - can serve as practical multi-task radiology AI assistants deployable entirely on consumer hardware without GPU requirements. Code and models are available at https://github.com/RadioX-Labs/RadLite

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LoRA-tuned 3-4B models deliver clear gains over zero-shot on public radiology tasks and run on CPUs, but the leap to practical clinical deployment rests on untested generalization.

read the letter

The core result is that Qwen2.5-3B and Qwen3-4B, after LoRA fine-tuning on 162K samples from 12 public datasets, show large lifts on nine radiology tasks and can be quantized to GGUF for 4-8 tokens per second on consumer CPUs. The two models complement each other, an oracle ensemble tops both, and few-shot prompting actually degrades the fine-tuned versions. Releasing the code and models is a plus for anyone who wants to reproduce or extend the work.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RadLite, a collection of LoRA fine-tuned 3-4B parameter models (Qwen2.5-3B-Instruct and Qwen3-4B) trained on 162K samples compiled from 12 public radiology datasets spanning 9 tasks (RADS classification, impression generation, temporal comparison, NLI, NER, abnormality detection, N/M staging, and Q&A). It reports large gains over zero-shot baselines (e.g., +53% RADS accuracy, +60% NLI, +89% N-staging), complementary strengths between the two models, superiority of LoRA adaptation over few-shot prompting, an oracle ensemble that combines them, and successful GGUF quantization enabling 4-8 tokens/second inference on consumer CPUs without GPUs.

Significance. If the quantitative results and deployment claims hold after detailed scrutiny, the work would be significant for demonstrating that small, efficiently adapted models can handle diverse radiology tasks at practical speeds on consumer hardware. This could lower barriers to AI assistance in resource-constrained clinical settings. The open release of code and models, the multi-task compilation, and the observation of task complementarity are positive contributions that support reproducibility and further research in domain-specific SLM adaptation.

major comments (2)

Abstract and §4 (Results): The large reported gains (+53% RADS accuracy, +60% NLI, +89% N-staging) are stated without specifying the exact evaluation metrics (accuracy vs. F1 vs. other), the precise zero-shot baseline configurations, per-task test sample counts and splits, or any statistical significance testing or confidence intervals. These details are load-bearing for interpreting whether the improvements are robust or potentially inflated by evaluation choices.
§4 (Evaluation) and §5 (Discussion): All performance numbers are measured on held-out samples drawn from the same 12 public datasets used for training. No external clinical validation set, multi-institutional test data, or out-of-distribution evaluation is described. This directly weakens the central claim that the models constitute 'practical' multi-task radiology AI assistants for real-world clinical deployment, where reporting styles, dictation errors, and institutional variability differ from curated public corpora.

minor comments (2)

Abstract: The phrase 'task-outed oracle ensemble' is nonstandard and undefined; it should be clarified (likely meaning an oracle that picks the stronger model per task) with a brief description of how the oracle is constructed.
§3 (Methods): The distribution of the 162K training samples across the 9 tasks and 12 datasets is not summarized (e.g., via a table of sample counts per task). This makes it difficult to assess task balance and potential dominance by larger datasets.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback, which has helped us improve the clarity and transparency of our work. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: Abstract and §4 (Results): The large reported gains (+53% RADS accuracy, +60% NLI, +89% N-staging) are stated without specifying the exact evaluation metrics (accuracy vs. F1 vs. other), the precise zero-shot baseline configurations, per-task test sample counts and splits, or any statistical significance testing or confidence intervals. These details are load-bearing for interpreting whether the improvements are robust or potentially inflated by evaluation choices.

Authors: We agree that additional methodological detail is necessary for proper interpretation. In the revised manuscript, we have expanded the abstract and §4 to specify the exact metrics for each task (accuracy for RADS classification, N/M staging, and NLI; F1-score for NER and abnormality detection; ROUGE-L and BLEU for impression generation and temporal comparison). The zero-shot baselines are now explicitly defined as the unmodified base Qwen2.5-3B-Instruct and Qwen3-4B models with direct prompting and no in-context examples. We report the precise held-out test sizes (200–500 samples per task) and the dataset-specific train/test splits. We have also added 95% bootstrap confidence intervals for all reported metrics and noted statistically significant improvements (p < 0.05 via paired t-tests). These changes directly address the concern about potential inflation due to evaluation choices. revision: yes
Referee: §4 (Evaluation) and §5 (Discussion): All performance numbers are measured on held-out samples drawn from the same 12 public datasets used for training. No external clinical validation set, multi-institutional test data, or out-of-distribution evaluation is described. This directly weakens the central claim that the models constitute 'practical' multi-task radiology AI assistants for real-world clinical deployment, where reporting styles, dictation errors, and institutional variability differ from curated public corpora.

Authors: The referee is correct that our evaluations are confined to held-out splits from the same public corpora. We have revised §5 to include an expanded limitations paragraph that explicitly discusses this constraint, the risks of domain shift from institutional reporting differences and dictation noise, and the need for prospective clinical validation. We have also softened language around immediate 'practical' deployment to emphasize the work as a proof-of-concept for CPU-deployable multi-task radiology SLMs. However, because the study relies exclusively on publicly released datasets, we do not have access to external multi-institutional or prospective clinical data and therefore cannot supply such validation results. revision: partial

standing simulated objections not resolved

We do not possess external clinical or multi-institutional validation data and cannot perform the requested out-of-distribution evaluation.

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation on held-out data

full rationale

The paper compiles 162K samples from 12 public datasets, applies LoRA fine-tuning to 3-4B models, and reports performance metrics on up to 500 held-out test samples per task against zero-shot baselines. No equations, parameter fits presented as predictions, self-citations, uniqueness theorems, or ansatzes appear in the provided text. All claims (accuracy gains, model complementarity, CPU deployment speeds) are direct experimental measurements on the chosen splits, not reductions by construction. This is standard supervised ML evaluation and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical performance of LoRA adaptation on compiled public datasets; no free parameters or invented entities are explicitly introduced beyond standard ML practices.

axioms (1)

domain assumption LoRA fine-tuning can substantially adapt small language models to specialized medical domains without catastrophic forgetting
Implicit in the approach and the reported gains over zero-shot baselines.

pith-pipeline@v0.9.0 · 5642 in / 1373 out tokens · 59565 ms · 2026-05-09T19:18:44.334786+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 14 canonical work pages · 5 internal anchors

[1]

J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., & Chen, W

Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models.International Conference on Learning Representations (ICLR)

2022
[2]

Yang, A., Yang, B., et al. (2025). Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Yang, A., et al. (2025). Qwen3 Technical Report.arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Singhal, K., Azizi, S., Tu, T., et al. (2023). Large Language Models Encode Clinical Knowledge. Nature, 620, 172–180

2023
[5]

& Topol, E

Mesk ´o, B. & Topol, E. J. (2023). The Imperative for Regulatory Oversight of Large Language Models (or Generative AI) in Healthcare.npj Digital Medicine, 6, 120

2023
[6]

Bose, K., et al. (2026). Multi-RADS Synthetic Radiology Report Dataset and Head-to- Head Benchmarking of 41 Open-Weight and Proprietary Language Models.arXiv preprint arXiv:2601.03232

work page arXiv 2026
[7]

Delbrouck, J.-B., Chambon, P., Chen, Z., Varma, M., Johnston, A., Blankemeier, L., Van Veen, D., et al. (2024). RadGraph-XL: A Large-Scale Expert-Annotated Dataset for Entity and Relation Extraction from Radiology Reports.Findings of the Association for Computational Linguistics (ACL 2024), 12902–12915. 16

2024
[8]

Khanna, S., Dejl, A., Yoon, K., Truong, S. Q. H., Duong, H., Saenz, A., & Rajpurkar, P. (2023). RadGraph2: Modeling Disease Progression in Radiology Reports via Hierarchical Information Extraction.Machine Learning for Healthcare (ML4H). arXiv:2308.05046

work page arXiv 2023
[9]

Ng, and Matthew P

Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A. Y ., & Lungren, M. P. (2020). CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT.Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:2004.09167

work page arXiv 2020
[10]

Miura, Y ., et al. (2021). Improving Factual Completeness and Consistency of Image-to-Text Radi- ology Report Generation.Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)

2021
[11]

Johnson, A. E. W., et al. (2019). MIMIC-CXR, a De-identified Publicly Available Database of Chest Radiographs with Free-Text Reports.Scientific Data, 6, 317

2019
[12]

E.et al.Generalist foundation models from a multimodal dataset for 3d computed tomography

Hamamci, I. E., et al. (2025). Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography.Nature Biomedical Engineering. DOI: 10.1038/s41551-025-01599-y. arXiv:2403.17834

work page doi:10.1038/s41551-025-01599-y 2025
[13]

Pellegrini, C., ¨Ozsoy, E., Busam, B., Navab, N., & Keicher, M. (2025). RaDialog: A Large Vision- Language Model for Radiology Report Generation and Conversational Assistance.Proceedings of the Medical Imaging with Deep Learning (MIDL). arXiv:2311.18681

work page arXiv 2025
[14]

Li, C., et al. (2023). LLaV A-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day.arXiv preprint arXiv:2306.00890

work page arXiv 2023
[15]

Jin, D., et al. (2021). What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14), 6421

2021
[16]

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs.Advances in Neural Information Processing Systems (NeurIPS)

2023
[17]

Chen, Z., et al. (2023). Meditron-70B: Scaling Medical Pretraining for Large Language Models. arXiv preprint arXiv:2311.16079

work page arXiv 2023
[18]

Labrak, Y ., et al. (2024). BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains.arXiv preprint arXiv:2402.10373

work page arXiv 2024
[19]

Abdin, M., et al. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.arXiv preprint arXiv:2404.14219

work page internal anchor Pith review arXiv 2024
[20]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team (2024). Gemma 2: Improving Open Language Models at a Practical Size.arXiv preprint arXiv:2408.00118

work page internal anchor Pith review arXiv 2024
[21]

Huang, C., et al. (2024). LoRAHub: Efficient Cross-Task Generalization via Dynamic LoRA Com- position.Proceedings of COLM

2024
[22]

Frantar, E., et al. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.International Conference on Learning Representations (ICLR)

2023
[23]

Lin, J., et al. (2024). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.arXiv preprint arXiv:2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

GGUF: GGML Universal File Format for Large Language Models

GGML Contributors (2023). GGUF: GGML Universal File Format for Large Language Models. https://github.com/ggerganov/ggml. 17

2023
[25]

Lin, C.-Y . (2004). ROUGE: A Package for Automatic Evaluation of Summaries.Text Summariza- tion Branches Out, 74–81

2004
[26]

Zhao, Z., Wallace, E., Feng, S., Klein, D., & Singh, S. (2021). Calibrate Before Use: Improv- ing Few-Shot Performance of Language Models.Proceedings of the International Conference on Machine Learning (ICML), 12697–12706

2021
[27]

Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., & Van Durme, B. (2018). Hypothesis Only Baselines in Natural Language Inference.Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics (*SEM 2018), 180–191

2018
[28]

J., Sickles, E

D’Orsi, C. J., Sickles, E. A., Mendelson, E. B., & Morris, E. A. (Eds.). (2013). ACR BI-RADS ® Atlas, Breast Imaging Reporting and Data System (5th ed.).American College of Radiology

2013
[29]

Bannur, S., et al. (2023). Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), 15016–15027

2023
[30]

D., & Langlotz, C

Zhang, Y ., Ding, D., Qian, T., Manning, C. D., & Langlotz, C. P. (2018). Learning to Summarize Radiology Findings.Proceedings of the LOUHI Workshop at EMNLP, 204–213

2018
[31]

Jain, S., Agrawal, A., Saporta, A., Truong, S. Q. H., Duong, D. N., Bui, T., Chambon, P., Zhang, Y ., Lungren, M. P., Ng, A. Y ., Langlotz, C. P., & Rajpurkar, P. (2021). RadGraph: Extracting Clinical Entities and Relations from Radiology Reports.Proceedings of the Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track

2021
[32]

Medalpaca–an open-source collection of medical conversational ai models and training data

Han, T., Adams, L. C., et al. (2023). MedAlpaca — An Open-Source Collection of Medical Con- versational AI Models and Training Data.arXiv preprint arXiv:2304.08247

work page arXiv 2023
[33]

J., Ting, D

Thirunavukarasu, A. J., Ting, D. S. J., et al. (2023). Large language models in medicine.Nature Medicine, 29, 1930–1940

2023
[34]

P., Liu, J., Liu, L., Van Veen, D., Gardezi, S

Blankemeier, L., Kumar, A., Cohen, J. P., Liu, J., Liu, L., Van Veen, D., Gardezi, S. J., Yu, H., Paschali, M., Chen, Z., & Delbrouck, J.-B. (2026). Merlin: a computed tomography vision- language foundation model and dataset.Nature. 2026 Mar 4:1–1. 18

2026