Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study
Pith reviewed 2026-06-28 10:39 UTC · model grok-4.3
The pith
Medical vision-language models show 8 to 25 percent worse performance on Indonesian radiology questions than English ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Strong performance on English medical VQA benchmarks does not necessarily translate to robust behavior in Indonesian clinical contexts. The evaluations show a performance gap of 8 to 25 percent between the English and Indonesian settings depending on the evaluation metric, with error modes including yes/no flips, laterality errors, and output-language mismatches.
What carries the argument
IndoRad-VQA dataset, created by translating VQA-RAD with self-evaluation-based quality control to preserve clinical meaning, terminology consistency, and answer equivalence.
If this is right
- Strong English results do not guarantee reliable answers under Indonesian prompting for the same visual inputs.
- Failure modes such as yes/no flips and laterality errors increase under language shift.
- Inclusive multilingual evaluation is required for medical multimodal foundation models.
- The released IndoRad-VQA dataset supports further testing of non-English medical VQA.
Where Pith is reading between the lines
- Similar performance drops may appear in other languages with comparable translation challenges and lower representation in training data.
- Targeted fine-tuning on Indonesian medical image-text pairs could reduce the gap, though this would need separate validation.
- The gap points to a broader issue where visual reasoning in VLMs remains coupled to the language of the question rather than purely to the image content.
Load-bearing premise
The self-evaluation-based quality control during translation preserves clinical meaning, terminology consistency, and answer equivalence.
What would settle it
A tested model achieving equal performance on the Indonesian IndoRad-VQA questions as on the original English VQA-RAD questions would falsify the reported language-induced performance gap.
Figures
read the original abstract
Medical Vision-Language Models (VLMs) are typically evaluated on English radiology visual question answering benchmarks, leaving their robustness under non-English clinical language largely unexplored. We introduce IndoRad-VQA, an Indonesian adaptation of VQA-RAD, to assess whether medical VLMs retain radiology reasoning ability when questions are asked in Bahasa Indonesia. Radiology question-answer pairs are translated into Indonesian with self-evaluation-based quality control to preserve clinical meaning, terminology consistency, and answer equivalence. We evaluate general-purpose, Southeast Asian multilingual, and medical-specific VLMs under English and Indonesian prompting settings. Beyond accuracy, we quantify the language robustness gap between English and Indonesian inputs. We also conduct an error analysis to identify failure modes of question answering, such as yes/no flips, laterality errors, and output-language mismatches. Our findings show that strong performance on English medical VQA benchmarks does not necessarily translate to robust behavior in Indonesian clinical contexts. We observe a performance gap of 8 to 25 percent between the English and Indonesian settings, depending on the evaluation metric. These results highlight the need for more inclusive multilingual evaluation of medical multimodal foundation models. The dataset is available at https://huggingface.co/datasets/Lab-IS/IndoRad-VQA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces IndoRad-VQA, an Indonesian adaptation of the VQA-RAD dataset, created by translating radiology question-answer pairs using self-evaluation-based quality control to maintain clinical meaning and answer equivalence. It evaluates general-purpose, multilingual, and medical-specific vision-language models on both English and Indonesian versions of the dataset, reporting performance gaps of 8 to 25 percent depending on the metric, along with an error analysis identifying issues such as yes/no flips, laterality errors, and output-language mismatches. The central claim is that strong performance on English medical VQA benchmarks does not necessarily indicate robustness in Indonesian clinical contexts.
Significance. If the IndoRad-VQA dataset faithfully preserves clinical content, this work provides valuable empirical evidence that current medical VLMs may not generalize well across languages, highlighting the need for multilingual evaluation in medical AI. The public release of the dataset on Hugging Face is a positive contribution that enables reproducibility and further studies.
major comments (1)
- [Abstract] Abstract: The self-evaluation-based quality control for translation is described without quantitative metrics such as translation error rates, inter-annotator agreement, or validation by Indonesian radiologists. This omission is load-bearing because the central claim attributes the 8-25% performance gap to language shift; without external clinical validation, the gap could stem from translation artifacts in terminology or answer equivalence.
minor comments (1)
- [Abstract] Abstract: The abstract mentions 'depending on the evaluation metric' but does not specify which metrics are used or provide the exact values for each.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the single major comment below and will incorporate revisions to strengthen the description of our translation quality control process.
read point-by-point responses
-
Referee: [Abstract] Abstract: The self-evaluation-based quality control for translation is described without quantitative metrics such as translation error rates, inter-annotator agreement, or validation by Indonesian radiologists. This omission is load-bearing because the central claim attributes the 8-25% performance gap to language shift; without external clinical validation, the gap could stem from translation artifacts in terminology or answer equivalence.
Authors: We agree that the abstract provides only a high-level description of the self-evaluation-based quality control without quantitative metrics. The full manuscript (Section 3.2) details that the process combined machine translation with LLM self-evaluation prompts to verify clinical meaning, terminology consistency, and answer equivalence, followed by author spot-checks. No inter-annotator agreement or radiologist validation was performed, as the workflow was primarily automated. In the revised manuscript we will expand the abstract to note the self-evaluation approach and add an explicit limitations paragraph acknowledging the lack of external clinical validation. We maintain that the 8-25% gaps are unlikely to be explained solely by translation artifacts, because the error analysis identifies consistent, language-specific failure modes (yes/no flips, laterality errors, output-language mismatches) across multiple models that align with known cross-lingual challenges rather than random translation noise. revision: yes
Circularity Check
No circularity: purely empirical dataset creation and model evaluation
full rationale
The paper introduces IndoRad-VQA by translating VQA-RAD question-answer pairs with self-evaluation QC and directly measures performance gaps on off-the-shelf VLMs under English vs. Indonesian prompting. No equations, derivations, fitted parameters, or predictions that reduce to inputs by construction are present. The central claim rests on observed accuracy differences (8-25%) rather than any self-referential step. Self-citations, if any, are not load-bearing for the empirical comparison. The study is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-evaluation-based quality control during translation preserves clinical meaning, terminology consistency, and answer equivalence
Reference graph
Works this paper leans on
-
[1]
L ´eo Butsanets, Charles Corbi `ere, Julien Khlaut, Pierre Manceron, and Corentin Dancette. Radimagenet-vqa: A large-scale ct and mri dataset for radiologic visual question answering.arXiv preprint, abs/2512.17396, 2025. 1
-
[2]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional trans- formers for language understanding.CoRR, abs/1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen, Wenrao Pang, Ruiyao Chen, Xinwei Peng, Renjie Lu, Si- jie Ren, Guanxu Zhu, Xiaoqin Wu, Zhiqiang Liu, Rongzhao Zhang, Luyi Jiang, Bing Han, Yunqiu Wang, and Jie Xu. Medbench v4: A robust and scalable benchmark for evalu- ating chinese medical language models, multimodal models, and intelligent agent...
-
[4]
Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan- Thorsten Peter, Juraj Juraska, Parker Riley, Daniel Deutsch, Cole Dilanni, Colin Cherry, Eleftheria Briakou, Elizabeth Nielsen, Jiaming Luo, Kat Black, Ryan Mullins, Sweta Agrawal, Wenda Xu, Erin Kats, S. A. Jaskiewicz, Markus Freitag, and David Vilar. Translategemma technical report. ArXiv, abs/2601.09...
-
[5]
Anak baik: A low-cost approach to curate Indonesian ethical and unethical instructions
Sulthan Abiyyu Hakim, Rizal Setya Perdana, and Tirana Noor Fatyanosa. Anak baik: A low-cost approach to curate Indonesian ethical and unethical instructions. InProceedings of the Second Workshop in South East Asian Language Processing, pages 52–62, Online, 2025. Association for Computational Linguistics. 2
2025
-
[6]
Lau, Soumya Gayen, Dina Demner-Fushman, et al
Jason J. Lau, Soumya Gayen, Dina Demner-Fushman, et al. A dataset of clinically generated visual questions and an- swers about radiology images.Scientific Data, 5(180251),
-
[7]
Diab, and Philipp Koehn
Daniel Licht, Cynthia Gao, Janice Lam, Francisco (Paco) Guzm´an, Mona T. Diab, and Philipp Koehn. Consistent hu- man evaluation of machine translation across language pairs. InConference of the Association for Machine Translation in the Americas, 2022. 2
2022
-
[8]
Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering
Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering. arXiv preprint, abs/2102.09542, 2021. 1
-
[9]
Mil-ut/japanese-medical-vqa-12m: A large-scale japanese medical visual question answering dataset
Machine Intelligence Laboratory, The University of Tokyo. Mil-ut/japanese-medical-vqa-12m: A large-scale japanese medical visual question answering dataset. Hugging Face Dataset, 2025.https : / / huggingface . co / datasets / MIL - UT / Japanese - Medical - VQA - 12m. 1
2025
-
[10]
Siow, Wei Yi Teo, Wayne Lau, Choon Meng Tan, Brandon Ong, Zhi Hao, Jann Railey, Adwin Chan, Sajeban Antonyrex, R
Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia, Waiyie Leong, Weiqi Leong, Jian Gang, Yosephine Su- santo, Nicholas Cheng, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Adithya Venkatadri, Hulagadri Kok, Wai 4 Po Kevin Teng, Yeo Yeow, Tong, Bryan Y . Siow, Wei Yi Teo, Wayne Lau, Choon Meng Tan, Brandon Ong, Zhi Hao, Jann Railey, Adwin Chan, Sajeb...
2025
-
[11]
Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. Medvlm-r1: Incentivizing medical reasoning ca- pability of vision-language models (vlms) via reinforcement learning.arXiv preprint arXiv:2502.19634, 2025. 2
-
[12]
Andrew B. Sellergren, Chufan Gao, Fereshteh Mahvar, Timo Kohlberger, Fayaz Jamil, Madeleine Traverse, Alberto Tono, Bashir Sadjad, Lin Yang, Charles Lau, Liron Yatziv, Tiffany Chen, Bram Sterling, Kenneth Philbrick, Richa Ti- wari, Yun Liu, Madhuram Jajoo, Chandrashekar Sankarapu, Swapnil Vispute, Harshad R Purandare, Abhishek Mishra, Samuel Schmidgall, T...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Qwen3 technical report, 2025
Qwen Team. Qwen3 technical report, 2025. 2
2025
-
[14]
Beyond BLEU: Training neural machine translation with semantic similarity
John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, and Graham Neubig. Beyond BLEU: Training neural machine translation with semantic similarity. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4344–4355, Florence, Italy, 2019. Associ- ation for Computational Linguistics. 2
2019
-
[15]
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Wein- berger, and Yoav Artzi. Bertscore: Evaluating text genera- tion with bert.ArXiv, abs/1904.09675, 2019. 2
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[16]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, De-Hua Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Ying Xi...
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.