Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study

Dzaki Rafif Malik; Novanto Yudistira; Pieter Christy Yan Yudhistira

arxiv: 2606.03693 · v1 · pith:QSVG74C4new · submitted 2026-06-02 · 💻 cs.CL · cs.CV

Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study

Pieter Christy Yan Yudhistira , Dzaki Rafif Malik , Novanto Yudistira This is my paper

Pith reviewed 2026-06-28 10:39 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords medical vision-language modelsradiology VQAIndonesianlanguage robustnessVQA-RADmultilingual evaluationclinical AIIndoRad-VQA

0 comments

The pith

Medical vision-language models show 8 to 25 percent worse performance on Indonesian radiology questions than English ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether medical vision-language models retain radiology reasoning ability when questions shift from English to Indonesian. It creates IndoRad-VQA by translating VQA-RAD pairs into Bahasa Indonesia using self-evaluation quality controls to keep clinical meaning and answer equivalence intact. Evaluations of general-purpose, multilingual, and medical-specific models reveal consistent drops in accuracy and other metrics. This matters because real clinical use often occurs in local languages, so English benchmark success alone cannot confirm reliable deployment.

Core claim

Strong performance on English medical VQA benchmarks does not necessarily translate to robust behavior in Indonesian clinical contexts. The evaluations show a performance gap of 8 to 25 percent between the English and Indonesian settings depending on the evaluation metric, with error modes including yes/no flips, laterality errors, and output-language mismatches.

What carries the argument

IndoRad-VQA dataset, created by translating VQA-RAD with self-evaluation-based quality control to preserve clinical meaning, terminology consistency, and answer equivalence.

If this is right

Strong English results do not guarantee reliable answers under Indonesian prompting for the same visual inputs.
Failure modes such as yes/no flips and laterality errors increase under language shift.
Inclusive multilingual evaluation is required for medical multimodal foundation models.
The released IndoRad-VQA dataset supports further testing of non-English medical VQA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar performance drops may appear in other languages with comparable translation challenges and lower representation in training data.
Targeted fine-tuning on Indonesian medical image-text pairs could reduce the gap, though this would need separate validation.
The gap points to a broader issue where visual reasoning in VLMs remains coupled to the language of the question rather than purely to the image content.

Load-bearing premise

The self-evaluation-based quality control during translation preserves clinical meaning, terminology consistency, and answer equivalence.

What would settle it

A tested model achieving equal performance on the Indonesian IndoRad-VQA questions as on the original English VQA-RAD questions would falsify the reported language-induced performance gap.

Figures

Figures reproduced from arXiv: 2606.03693 by Dzaki Rafif Malik, Novanto Yudistira, Pieter Christy Yan Yudhistira.

**Figure 2.** Figure 2: Qualitative failures examples of yes/no flip, language [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Medical Vision-Language Models (VLMs) are typically evaluated on English radiology visual question answering benchmarks, leaving their robustness under non-English clinical language largely unexplored. We introduce IndoRad-VQA, an Indonesian adaptation of VQA-RAD, to assess whether medical VLMs retain radiology reasoning ability when questions are asked in Bahasa Indonesia. Radiology question-answer pairs are translated into Indonesian with self-evaluation-based quality control to preserve clinical meaning, terminology consistency, and answer equivalence. We evaluate general-purpose, Southeast Asian multilingual, and medical-specific VLMs under English and Indonesian prompting settings. Beyond accuracy, we quantify the language robustness gap between English and Indonesian inputs. We also conduct an error analysis to identify failure modes of question answering, such as yes/no flips, laterality errors, and output-language mismatches. Our findings show that strong performance on English medical VQA benchmarks does not necessarily translate to robust behavior in Indonesian clinical contexts. We observe a performance gap of 8 to 25 percent between the English and Indonesian settings, depending on the evaluation metric. These results highlight the need for more inclusive multilingual evaluation of medical multimodal foundation models. The dataset is available at https://huggingface.co/datasets/Lab-IS/IndoRad-VQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New IndoRad-VQA dataset shows 8-25% English-to-Indonesian drop on medical VQA, but translation QC is the load-bearing unverified step.

read the letter

The main takeaway is that this paper gives the first numbers on how much medical VLMs drop when radiology questions move from English to Indonesian, using a new dataset called IndoRad-VQA adapted from VQA-RAD. The reported gap of 8-25% depending on metric is the concrete result worth noting.

They translate the original QA pairs with self-evaluation QC meant to keep clinical meaning and answer equivalence, then run general-purpose, Southeast Asian multilingual, and medical-specific VLMs in both languages. They also include an error analysis covering yes/no flips, laterality mistakes, and output language mismatches. The dataset is released publicly.

What works is the direct empirical comparison and the release of the resource. It fills a clear gap by moving beyond English-only benchmarks and gives a usable starting point for anyone testing robustness in Indonesian clinical text. The error breakdown adds some practical detail on failure modes.

The soft spot is the translation validation. The abstract describes self-evaluation QC but gives no numbers on error rates, inter-annotator agreement, or whether Indonesian radiologists reviewed the output. If the process missed domain terms like laterality or pathology names, the measured gap could partly reflect translation artifacts rather than pure language shift. That assumption carries a lot of weight here.

This is for people working on medical multimodal models who need non-English test sets or are concerned about deployment outside English data. A reader focused on robustness or dataset construction would find the new benchmark and gap numbers useful.

It deserves peer review because the empirical result is new and the dataset is released, even though the methods section will need more detail on how the translations were checked.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces IndoRad-VQA, an Indonesian adaptation of the VQA-RAD dataset, created by translating radiology question-answer pairs using self-evaluation-based quality control to maintain clinical meaning and answer equivalence. It evaluates general-purpose, multilingual, and medical-specific vision-language models on both English and Indonesian versions of the dataset, reporting performance gaps of 8 to 25 percent depending on the metric, along with an error analysis identifying issues such as yes/no flips, laterality errors, and output-language mismatches. The central claim is that strong performance on English medical VQA benchmarks does not necessarily indicate robustness in Indonesian clinical contexts.

Significance. If the IndoRad-VQA dataset faithfully preserves clinical content, this work provides valuable empirical evidence that current medical VLMs may not generalize well across languages, highlighting the need for multilingual evaluation in medical AI. The public release of the dataset on Hugging Face is a positive contribution that enables reproducibility and further studies.

major comments (1)

[Abstract] Abstract: The self-evaluation-based quality control for translation is described without quantitative metrics such as translation error rates, inter-annotator agreement, or validation by Indonesian radiologists. This omission is load-bearing because the central claim attributes the 8-25% performance gap to language shift; without external clinical validation, the gap could stem from translation artifacts in terminology or answer equivalence.

minor comments (1)

[Abstract] Abstract: The abstract mentions 'depending on the evaluation metric' but does not specify which metrics are used or provide the exact values for each.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and will incorporate revisions to strengthen the description of our translation quality control process.

read point-by-point responses

Referee: [Abstract] Abstract: The self-evaluation-based quality control for translation is described without quantitative metrics such as translation error rates, inter-annotator agreement, or validation by Indonesian radiologists. This omission is load-bearing because the central claim attributes the 8-25% performance gap to language shift; without external clinical validation, the gap could stem from translation artifacts in terminology or answer equivalence.

Authors: We agree that the abstract provides only a high-level description of the self-evaluation-based quality control without quantitative metrics. The full manuscript (Section 3.2) details that the process combined machine translation with LLM self-evaluation prompts to verify clinical meaning, terminology consistency, and answer equivalence, followed by author spot-checks. No inter-annotator agreement or radiologist validation was performed, as the workflow was primarily automated. In the revised manuscript we will expand the abstract to note the self-evaluation approach and add an explicit limitations paragraph acknowledging the lack of external clinical validation. We maintain that the 8-25% gaps are unlikely to be explained solely by translation artifacts, because the error analysis identifies consistent, language-specific failure modes (yes/no flips, laterality errors, output-language mismatches) across multiple models that align with known cross-lingual challenges rather than random translation noise. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical dataset creation and model evaluation

full rationale

The paper introduces IndoRad-VQA by translating VQA-RAD question-answer pairs with self-evaluation QC and directly measures performance gaps on off-the-shelf VLMs under English vs. Indonesian prompting. No equations, derivations, fitted parameters, or predictions that reduce to inputs by construction are present. The central claim rests on observed accuracy differences (8-25%) rather than any self-referential step. Self-citations, if any, are not load-bearing for the empirical comparison. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical dataset translation and benchmarking study; the only notable assumption is that the translation step maintains clinical fidelity.

axioms (1)

domain assumption Self-evaluation-based quality control during translation preserves clinical meaning, terminology consistency, and answer equivalence
Invoked in the abstract as the method used to create the Indonesian version of VQA-RAD.

pith-pipeline@v0.9.1-grok · 5760 in / 1179 out tokens · 26166 ms · 2026-06-28T10:39:49.968241+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 9 canonical work pages · 4 internal anchors

[1]

Radimagenet-vqa: A large-scale ct and mri dataset for radiologic visual question answering.arXiv preprint, abs/2512.17396, 2025

L ´eo Butsanets, Charles Corbi `ere, Julien Khlaut, Pierre Manceron, and Corentin Dancette. Radimagenet-vqa: A large-scale ct and mri dataset for radiologic visual question answering.arXiv preprint, abs/2512.17396, 2025. 1

work page arXiv 2025
[2]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional trans- formers for language understanding.CoRR, abs/1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Medbench v4: A robust and scalable benchmark for evalu- ating chinese medical language models, multimodal models, and intelligent agents.ArXiv, abs/2511.14439, 2025

Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen, Wenrao Pang, Ruiyao Chen, Xinwei Peng, Renjie Lu, Si- jie Ren, Guanxu Zhu, Xiaoqin Wu, Zhiqiang Liu, Rongzhao Zhang, Luyi Jiang, Bing Han, Yunqiu Wang, and Jie Xu. Medbench v4: A robust and scalable benchmark for evalu- ating chinese medical language models, multimodal models, and intelligent agent...

work page arXiv 2025
[4]

Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan- Thorsten Peter, Juraj Juraska, Parker Riley, Daniel Deutsch, Cole Dilanni, Colin Cherry, Eleftheria Briakou, Elizabeth Nielsen, Jiaming Luo, Kat Black, Ryan Mullins, Sweta Agrawal, Wenda Xu, Erin Kats, S. A. Jaskiewicz, Markus Freitag, and David Vilar. Translategemma technical report. ArXiv, abs/2601.09...

work page arXiv 2026
[5]

Anak baik: A low-cost approach to curate Indonesian ethical and unethical instructions

Sulthan Abiyyu Hakim, Rizal Setya Perdana, and Tirana Noor Fatyanosa. Anak baik: A low-cost approach to curate Indonesian ethical and unethical instructions. InProceedings of the Second Workshop in South East Asian Language Processing, pages 52–62, Online, 2025. Association for Computational Linguistics. 2

2025
[6]

Lau, Soumya Gayen, Dina Demner-Fushman, et al

Jason J. Lau, Soumya Gayen, Dina Demner-Fushman, et al. A dataset of clinically generated visual questions and an- swers about radiology images.Scientific Data, 5(180251),
[7]

Diab, and Philipp Koehn

Daniel Licht, Cynthia Gao, Janice Lam, Francisco (Paco) Guzm´an, Mona T. Diab, and Philipp Koehn. Consistent hu- man evaluation of machine translation across language pairs. InConference of the Association for Machine Translation in the Americas, 2022. 2

2022
[8]

Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering. arXiv preprint, abs/2102.09542, 2021. 1

work page arXiv 2021
[9]

Mil-ut/japanese-medical-vqa-12m: A large-scale japanese medical visual question answering dataset

Machine Intelligence Laboratory, The University of Tokyo. Mil-ut/japanese-medical-vqa-12m: A large-scale japanese medical visual question answering dataset. Hugging Face Dataset, 2025.https : / / huggingface . co / datasets / MIL - UT / Japanese - Medical - VQA - 12m. 1

2025
[10]

Siow, Wei Yi Teo, Wayne Lau, Choon Meng Tan, Brandon Ong, Zhi Hao, Jann Railey, Adwin Chan, Sajeban Antonyrex, R

Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia, Waiyie Leong, Weiqi Leong, Jian Gang, Yosephine Su- santo, Nicholas Cheng, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Adithya Venkatadri, Hulagadri Kok, Wai 4 Po Kevin Teng, Yeo Yeow, Tong, Bryan Y . Siow, Wei Yi Teo, Wayne Lau, Choon Meng Tan, Brandon Ong, Zhi Hao, Jann Railey, Adwin Chan, Sajeb...

2025
[11]

Qian, T.,et al.,

Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. Medvlm-r1: Incentivizing medical reasoning ca- pability of vision-language models (vlms) via reinforcement learning.arXiv preprint arXiv:2502.19634, 2025. 2

work page arXiv 2025
[12]

Andrew B. Sellergren, Chufan Gao, Fereshteh Mahvar, Timo Kohlberger, Fayaz Jamil, Madeleine Traverse, Alberto Tono, Bashir Sadjad, Lin Yang, Charles Lau, Liron Yatziv, Tiffany Chen, Bram Sterling, Kenneth Philbrick, Richa Ti- wari, Yun Liu, Madhuram Jajoo, Chandrashekar Sankarapu, Swapnil Vispute, Harshad R Purandare, Abhishek Mishra, Samuel Schmidgall, T...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 2

2025
[14]

Beyond BLEU: Training neural machine translation with semantic similarity

John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, and Graham Neubig. Beyond BLEU: Training neural machine translation with semantic similarity. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4344–4355, Florence, Italy, 2019. Associ- ation for Computational Linguistics. 2

2019
[15]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Wein- berger, and Yoav Artzi. Bertscore: Evaluating text genera- tion with bert.ArXiv, abs/1904.09675, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 1904
[16]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, De-Hua Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Ying Xi...

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Radimagenet-vqa: A large-scale ct and mri dataset for radiologic visual question answering.arXiv preprint, abs/2512.17396, 2025

L ´eo Butsanets, Charles Corbi `ere, Julien Khlaut, Pierre Manceron, and Corentin Dancette. Radimagenet-vqa: A large-scale ct and mri dataset for radiologic visual question answering.arXiv preprint, abs/2512.17396, 2025. 1

work page arXiv 2025

[2] [2]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional trans- formers for language understanding.CoRR, abs/1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Medbench v4: A robust and scalable benchmark for evalu- ating chinese medical language models, multimodal models, and intelligent agents.ArXiv, abs/2511.14439, 2025

Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen, Wenrao Pang, Ruiyao Chen, Xinwei Peng, Renjie Lu, Si- jie Ren, Guanxu Zhu, Xiaoqin Wu, Zhiqiang Liu, Rongzhao Zhang, Luyi Jiang, Bing Han, Yunqiu Wang, and Jie Xu. Medbench v4: A robust and scalable benchmark for evalu- ating chinese medical language models, multimodal models, and intelligent agent...

work page arXiv 2025

[4] [4]

Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan- Thorsten Peter, Juraj Juraska, Parker Riley, Daniel Deutsch, Cole Dilanni, Colin Cherry, Eleftheria Briakou, Elizabeth Nielsen, Jiaming Luo, Kat Black, Ryan Mullins, Sweta Agrawal, Wenda Xu, Erin Kats, S. A. Jaskiewicz, Markus Freitag, and David Vilar. Translategemma technical report. ArXiv, abs/2601.09...

work page arXiv 2026

[5] [5]

Anak baik: A low-cost approach to curate Indonesian ethical and unethical instructions

Sulthan Abiyyu Hakim, Rizal Setya Perdana, and Tirana Noor Fatyanosa. Anak baik: A low-cost approach to curate Indonesian ethical and unethical instructions. InProceedings of the Second Workshop in South East Asian Language Processing, pages 52–62, Online, 2025. Association for Computational Linguistics. 2

2025

[6] [6]

Lau, Soumya Gayen, Dina Demner-Fushman, et al

Jason J. Lau, Soumya Gayen, Dina Demner-Fushman, et al. A dataset of clinically generated visual questions and an- swers about radiology images.Scientific Data, 5(180251),

[7] [7]

Diab, and Philipp Koehn

Daniel Licht, Cynthia Gao, Janice Lam, Francisco (Paco) Guzm´an, Mona T. Diab, and Philipp Koehn. Consistent hu- man evaluation of machine translation across language pairs. InConference of the Association for Machine Translation in the Americas, 2022. 2

2022

[8] [8]

Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering. arXiv preprint, abs/2102.09542, 2021. 1

work page arXiv 2021

[9] [9]

Mil-ut/japanese-medical-vqa-12m: A large-scale japanese medical visual question answering dataset

Machine Intelligence Laboratory, The University of Tokyo. Mil-ut/japanese-medical-vqa-12m: A large-scale japanese medical visual question answering dataset. Hugging Face Dataset, 2025.https : / / huggingface . co / datasets / MIL - UT / Japanese - Medical - VQA - 12m. 1

2025

[10] [10]

Siow, Wei Yi Teo, Wayne Lau, Choon Meng Tan, Brandon Ong, Zhi Hao, Jann Railey, Adwin Chan, Sajeban Antonyrex, R

Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia, Waiyie Leong, Weiqi Leong, Jian Gang, Yosephine Su- santo, Nicholas Cheng, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Adithya Venkatadri, Hulagadri Kok, Wai 4 Po Kevin Teng, Yeo Yeow, Tong, Bryan Y . Siow, Wei Yi Teo, Wayne Lau, Choon Meng Tan, Brandon Ong, Zhi Hao, Jann Railey, Adwin Chan, Sajeb...

2025

[11] [11]

Qian, T.,et al.,

Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. Medvlm-r1: Incentivizing medical reasoning ca- pability of vision-language models (vlms) via reinforcement learning.arXiv preprint arXiv:2502.19634, 2025. 2

work page arXiv 2025

[12] [12]

Andrew B. Sellergren, Chufan Gao, Fereshteh Mahvar, Timo Kohlberger, Fayaz Jamil, Madeleine Traverse, Alberto Tono, Bashir Sadjad, Lin Yang, Charles Lau, Liron Yatziv, Tiffany Chen, Bram Sterling, Kenneth Philbrick, Richa Ti- wari, Yun Liu, Madhuram Jajoo, Chandrashekar Sankarapu, Swapnil Vispute, Harshad R Purandare, Abhishek Mishra, Samuel Schmidgall, T...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 2

2025

[14] [14]

Beyond BLEU: Training neural machine translation with semantic similarity

John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, and Graham Neubig. Beyond BLEU: Training neural machine translation with semantic similarity. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4344–4355, Florence, Italy, 2019. Associ- ation for Computational Linguistics. 2

2019

[15] [15]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Wein- berger, and Yoav Artzi. Bertscore: Evaluating text genera- tion with bert.ArXiv, abs/1904.09675, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 1904

[16] [16]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, De-Hua Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Ying Xi...

work page internal anchor Pith review Pith/arXiv arXiv