arxiv: 2605.13292 · v1 · submitted 2026-05-13 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

Recognition: no theorem link

IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

Shubham Kumar Nigam , Suparnojit Sarkar , Piyush Patel

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG

keywords medical dialogue datasetmultilingual medical AIIndic languagesmulti-turn conversationparameter-efficient fine-tuningsynthetic data generationclinical plausibility evaluation

0 comments

The pith

IndicMedDialog supplies parallel multi-turn medical dialogues in English and nine Indic languages to support personalized symptom-elicitation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a new dataset called IndicMedDialog that pairs English medical consultations with versions in Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. The dialogues are created by extending an existing English resource with LLM-generated content, translated through a specialized model, checked by native speakers, and cleaned with a script-aware pipeline. The authors then fine-tune a small language model on this data so it can handle multi-turn conversations while optionally using patient background for personalization. They compare the fine-tuned model against zero-shot baselines, run error analysis across all ten languages, and have medical experts review the outputs for clinical plausibility.

Core claim

IndicMedDialog is a parallel multi-turn medical dialogue dataset covering English and nine Indic languages; it is built from LLM-generated synthetic consultations translated with TranslateGemma, verified by native speakers, and post-processed for script accuracy, then used to fine-tune IndicMedLM via parameter-efficient adaptation so the model can conduct personalized, multi-turn symptom elicitation.

What carries the argument

IndicMedDialog dataset — a parallel corpus of multi-turn medical dialogues that supplies both training data and evaluation material across ten languages.

If this is right

A fine-tuned IndicMedLM can maintain coherent multi-turn symptom questioning while incorporating optional patient pre-context.
The same dataset enables direct comparison of model performance across English and nine Indic languages under identical medical scenarios.
Native-speaker verification plus script-aware post-processing reduces phonetic, lexical, and spacing errors that otherwise break dialogue flow.
Error analysis across ten languages reveals language-specific failure modes that future training runs can target.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset could serve as a testbed for measuring how well small quantized models retain medical reasoning after cross-lingual transfer.
If the synthetic dialogues prove robust, similar pipelines might accelerate creation of domain-specific conversation data in other low-resource languages.
Real-world deployment would still require ongoing clinical oversight because the current expert review covers only a sample of outputs.

Load-bearing premise

LLM-generated synthetic consultations, once translated and checked by native speakers, produce clinically plausible dialogues that match real patient-provider exchanges without systematic bias or factual mistakes.

What would settle it

Medical experts reviewing a random sample of the generated dialogues flag repeated factual errors, missing clinical steps, or culturally inappropriate advice at a rate above 15 percent.

Figures

Figures reproduced from arXiv: 2605.13292 by Piyush Patel, Shubham Kumar Nigam, Suparnojit Sarkar.

**Figure 1.** Figure 1: Response from a general-purpose LLM (ChatGPT). The model produces a single explanatory answer without follow-up questioning or symptom elicitation. Contributions. The main contributions of this work are: • We construct IndicMedDialog, the first parallel multi-turn medical dialogue dataset spanning English and nine Indic languages, with native-speaker verification and scriptaware post-processing for tran… view at source ↗

**Figure 2.** Figure 2: Example interaction with IndicMedLM. The system incorporates patient pre-context (age, gender, allergies) and conducts structured multiturn symptom elicitation before producing a final diagnosis. 2022; Yang et al., 2024), while MediTOD targets structured English medical historytaking (Saley et al., 2024). Domain-specific fine-tuning of LLMs (e.g., ChatDoctor (Li et al., 2023)) substantially improves medi… view at source ↗

**Figure 3.** Figure 3: Overview of the IndicMedDialog framework. The MDDial dataset is augmented with synthetic dialogues, filtered through quality control, and translated into nine Indic languages to form a parallel corpus. Compact models are then fine-tuned using parameter-efficient methods to obtain IndicMedLM, which performs multi-turn diagnosis using an optional patient pre-context. scored on a Likert scale of 1–5 (Very Poo… view at source ↗

**Figure 4.** Figure 4: illustrates the range of phonetically and lexically incorrect Bengali variants generated for the medical term asthma (আস্থমা). The erroneous forms include incorrect vowel mappings, spurious character insertions, and erroneous spacing within conjunct consonants, all of which are mapped to the correct canonical form through our post-processing pipeline. আআস্থস্থমা হমা আসথমা আস্মা আশ্থমা আশ্মা আস্তমা আষ্থম… view at source ↗

**Figure 5.** Figure 5: Examples of phonetically and lexically in [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Most existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. We introduce IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. The dataset extends MDDial with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation of a quantized small language model, incorporating optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate against zero-shot multilingual baselines, conduct systematic error analysis across ten languages, and validate clinical plausibility through medical expert evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IndicMedDialog adds a needed parallel dataset for Indic medical dialogues, though missing metrics make the quality hard to assess right now.

read the letter

The main point is that IndicMedDialog gives us a parallel multi-turn medical dialogue set for English and nine Indic languages, created by taking MDDial, generating synthetic consultations with an LLM, translating them with TranslateGemma, and having native speakers check them before some script fixes. They use this to fine-tune a quantized small model with parameter-efficient methods for symptom elicitation, adding patient context optionally. This stands out because it targets languages that get little attention in medical NLP and keeps the multi-turn structure, which is more realistic than single-turn setups. The translation and verification steps, plus the post-processing for phonetic and spacing issues, show practical effort to make the data usable across scripts. That said, the description stops short on results. There are no numbers on how often the native verification caught or missed errors, no model accuracy figures, and no details on the expert evaluation beyond saying it happened. This leaves the clinical plausibility hard to judge, especially since the data starts as LLM-generated and could carry over hallucinations that non-experts might not spot. The stress-test note about verification reliability seems on target given what's here. The work is aimed at researchers focused on accessible healthcare tech for Indic speakers. A reader interested in multilingual datasets or low-resource medical AI would find the resource itself useful to explore, even if the paper is light on evaluation. I think it deserves peer review. Dataset papers often need referee input to strengthen the validation parts, and this one has enough novelty in coverage to warrant that.

Referee Report

2 major / 2 minor

Summary. The paper introduces IndicMedDialog, a parallel multi-turn medical dialogue dataset covering English and nine Indic languages (Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, Urdu). It is constructed by extending MDDial via LLM-generated synthetic consultations, TranslateGemma translation, native-speaker verification, and script-aware post-processing. The authors then fine-tune IndicMedLM via parameter-efficient adaptation of a quantized small language model (with optional patient pre-context), evaluate against zero-shot multilingual baselines, perform error analysis across ten languages, and validate clinical plausibility via medical expert evaluation.

Significance. If the dataset construction pipeline produces clinically plausible dialogues without systematic factual errors or biases, and if the fine-tuned model shows meaningful gains over baselines, the work would provide a valuable resource for multilingual medical dialogue systems in low-resource Indic languages, addressing a clear gap in accessible healthcare AI.

major comments (2)

[Abstract / Evaluation section] Abstract and evaluation description: the manuscript states that clinical plausibility is validated through medical expert evaluation and that systematic error analysis is conducted, yet supplies no quantitative results (e.g., inter-rater agreement, error rates, or comparison against real consultations). This information is load-bearing for the central claim that the synthetic data faithfully represents patient-provider interactions.
[Abstract / Dataset construction] Dataset creation pipeline (described in abstract): the reliance on native-speaker verification to catch LLM hallucinations and factual errors in medical content is not supported by reported metrics or protocols showing that non-expert verifiers reliably detect clinical inaccuracies; this assumption underpins the entire dataset's utility for downstream fine-tuning.

minor comments (2)

[Abstract] Clarify the base small language model used for IndicMedLM (name, size, quantization details) and the exact parameter-efficient adaptation method (e.g., LoRA rank, target modules).
[Abstract] The post-processing pipeline for phonetic, lexical, and character-spacing errors should include concrete examples of corrections applied per language.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for strengthening the manuscript's claims about dataset quality and validation. We address each major comment below and will incorporate revisions to provide the requested quantitative details and protocol transparency.

read point-by-point responses

Referee: [Abstract / Evaluation section] Abstract and evaluation description: the manuscript states that clinical plausibility is validated through medical expert evaluation and that systematic error analysis is conducted, yet supplies no quantitative results (e.g., inter-rater agreement, error rates, or comparison against real consultations). This information is load-bearing for the central claim that the synthetic data faithfully represents patient-provider interactions.

Authors: We agree that quantitative results are necessary to substantiate the claims. In the revised manuscript, we will expand the Evaluation section with specific metrics from the medical expert evaluation, including inter-rater agreement (e.g., Cohen's kappa), categorized error rates (factual inaccuracies, hallucinations, clinical implausibility), and comparisons of dialogue statistics such as average turns, symptom coverage, and lexical diversity against publicly available real medical dialogue corpora. We will also detail the expert evaluation protocol, including the number of experts, their qualifications, and scoring rubrics. revision: yes
Referee: [Abstract / Dataset construction] Dataset creation pipeline (described in abstract): the reliance on native-speaker verification to catch LLM hallucinations and factual errors in medical content is not supported by reported metrics or protocols showing that non-expert verifiers reliably detect clinical inaccuracies; this assumption underpins the entire dataset's utility for downstream fine-tuning.

Authors: We acknowledge the importance of documenting the verification process rigorously. The revised version will include a new subsection on native-speaker verification that reports the full protocol: number of verifiers per language, their selection (native speakers, with preference for those having medical familiarity), provided guidelines for identifying hallucinations and errors, and quantitative outcomes such as inter-verifier agreement rates and the percentage of dialogues flagged for correction. We will also explain how this step is complemented by the medical expert review to address potential limitations of non-expert detection. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction and fine-tuning rely on external verification steps

full rationale

The paper describes creation of IndicMedDialog by extending MDDial via LLM-generated consultations, TranslateGemma translation, native-speaker verification, and script post-processing, followed by parameter-efficient fine-tuning of IndicMedLM and evaluation against baselines plus expert plausibility checks. No equations, fitted parameters, or self-referential derivations appear; claims do not reduce to inputs by construction. External human verification and expert evaluation provide independent grounding, making the work self-contained without circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on two untested domain assumptions: that synthetic LLM dialogues can be made representative of real medical conversations through translation and verification, and that the translation model preserves clinical meaning across languages. No free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption LLM-generated synthetic consultations can be refined into clinically plausible multi-turn medical dialogues via translation and native-speaker verification
Invoked to justify extending MDDial into the new IndicMedDialog dataset.
domain assumption TranslateGemma produces translations that preserve medical accuracy and conversational naturalness for the target Indic languages
Central to the dataset creation pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5466 in / 1422 out tokens · 40821 ms · 2026-05-14T19:00:32.441240+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 20 canonical work pages · 8 internal anchors

[1]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Bimedix: Bilingual medical mixture of experts llm , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024
[2]

arXiv preprint arXiv:2308.08147 , year=

Mddial: A multi-turn differential diagnosis dialogue dataset with reliability evaluation , author=. arXiv preprint arXiv:2308.08147 , year=

work page arXiv
[3]

Real-World Doctor Agent with Proactive Consultation through Multi-Agent Reinforcement Learning

Doctoragent-rl: A multi-agent collaborative reinforcement learning system for multi-turn clinical dialogue , author=. arXiv preprint arXiv:2505.19630 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages=

MedDialog: Large-scale medical dialogue datasets , author=. Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages=

2020
[5]

CCF International Conference on Natural Language Processing and Chinese Computing , pages=

Meddg: an entity-centric medical consultation dataset for entity-aware medical dialogue generation , author=. CCF International Conference on Natural Language Processing and Chinese Computing , pages=. 2022 , organization=

2022
[6]

Proceedings of the AAAI conference on artificial intelligence , volume=

Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[7]

Chatcounselor: A large language models for mental health support (2023) , author=

2023
[8]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Notechat: a dataset of synthetic patient-physician conversations conditioned on clinical notes , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024
[9]

arXiv preprint arXiv:2310.15896 , year=

Bianque: Balancing the questioning and suggestion ability of health llms with multi-turn health conversations polished by chatgpt , author=. arXiv preprint arXiv:2310.15896 , year=

work page arXiv
[10]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Meditod: An english dialogue dataset for medical history taking with comprehensive annotations , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[11]

arXiv preprint arXiv:2401.05654 , year=

Towards conversational diagnostic AI , author=. arXiv preprint arXiv:2401.05654 , year=

work page arXiv
[12]

Med-Chat: Tuning ChatGLM3-6B with Chinese Medical Dialogue , year=

Chu, Jiqing and Sun, Youqiang and Huang, He and Liu, Yuan , booktitle=. Med-Chat: Tuning ChatGLM3-6B with Chinese Medical Dialogue , year=
[13]

T-Agent: A Term-Aware Agent for Medical Dialogue Generation , year=

Hu, Zefa and Zhao, Haozhi and Zhao, Yuanyuan and Xu, Shuang and Xu, Bo , booktitle=. T-Agent: A Term-Aware Agent for Medical Dialogue Generation , year=
[14]

Cureus , volume=

Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge , author=. Cureus , volume=. 2023 , publisher=

2023
[15]

Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

Dailydialog: A manually labelled multi-turn dialogue dataset , author=. Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
[16]

arXiv preprint arXiv:2601.03023 , year=

MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models , author=. arXiv preprint arXiv:2601.03023 , year=

work page arXiv
[17]

arXiv preprint arXiv:2409.17610 , year=

ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue , author=. arXiv preprint arXiv:2409.17610 , year=

work page arXiv
[18]

IEEE Transactions on Consumer Electronics , year=

Continuous Entity Reasoning for Multi-Turn Medical Dialogue Generation , author=. IEEE Transactions on Consumer Electronics , year=
[19]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Cdialog: A multi-turn covid-19 conversation dataset for entity-aware dialog generation , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022
[20]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Topic-aware multi-turn dialogue modeling , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[21]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

MuTual: A dataset for multi-turn dialogue reasoning , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
[22]

ACM Comput

Yi, Zihao and Ouyang, Jiarui and Xu, Zhe and Liu, Yuwen and Liao, Tianhao and Luo, Haohao and Shen, Ying , title =. ACM Comput. Surv. , month = dec, articleno =. 2025 , issue_date =. doi:10.1145/3771090 , abstract =

work page doi:10.1145/3771090 2025
[23]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

Improving multi-turn dialogue modelling with utterance rewriter , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=
[24]

arXiv preprint arXiv:2103.03125 , year=

Advances in multi-turn dialogue comprehension: A survey , author=. arXiv preprint arXiv:2103.03125 , year=

work page arXiv
[25]

arXiv preprint arXiv:2402.17262 , year=

Speak out of turn: Safety vulnerability of large language models in multi-turn dialogue , author=. arXiv preprint arXiv:2402.17262 , year=

work page arXiv
[26]

H., Kasner, Z., and Reddy, S

Weblinx: Real-world website navigation with multi-turn dialogue , author=. arXiv preprint arXiv:2402.05930 , year=

work page arXiv
[27]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mmdialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[28]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

arXiv preprint arXiv:2601.09012 , year=

TranslateGemma Technical Report , author=. arXiv preprint arXiv:2601.09012 , year=

work page arXiv
[30]

2026 , eprint=

Tiny Aya: Bridging Scale and Multilingual Depth , author=. 2026 , eprint=

2026
[31]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

6g non-terrestrial networks enabled low-altitude economy: Opportunities and challenges.arXiv preprint arXiv:2311.09047, 2023

6G non-terrestrial networks enabled low-altitude economy: Opportunities and challenges , author=. arXiv preprint arXiv:2311.09047 , year=

work page arXiv
[33]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Deepseek llm: Scaling open-source language models with longtermism , author=. arXiv preprint arXiv:2401.02954 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

2025 , eprint=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

2025
[35]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[36]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Gemma: Open Models Based on Gemini Research and Technology

Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=
[39]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

2011 , journal=

Computing Krippendorff's alpha-reliability , author=. 2011 , journal=

2011
[42]

Llama 3 Model Card , author=. , year=
[43]

International conference on machine learning , pages=

Calibrate before use: Improving few-shot performance of language models , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[44]

arXiv preprint arXiv:2603.24132 , year=

MedAidDialog: A Multilingual Multi-Turn Medical Dialogue Dataset for Accessible Healthcare , author=. arXiv preprint arXiv:2603.24132 , year=

work page arXiv