Implicit Geographic Inference in LLM Medical Triage: Language-Driven Disparities in Emergency Recommendations
Pith reviewed 2026-06-28 16:55 UTC · model grok-4.3
The pith
Large language models recommend emergency room visits for identical symptoms at rates from 0% to 30% depending only on the language of the query.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When the same symptom profile is presented in six languages, the model recommends an emergency-room visit at rates ranging from 0% (Japanese, Hindi) to 30% (English, Arabic) while giving almost identical severity scores (7.7–8.0/10). Explicitly anchoring the patient to a U.S. location raises the emergency-room rate for non-English prompts by up to 76.7 percentage points; anchoring an English prompt to Tokyo lowers the rate to 6.7%. A back-translation control produces rates comparable to the English baseline, indicating that the disparity arises from implicit geographic inference drawn from the input language rather than from translation artifacts or prompt formatting.
What carries the argument
Implicit geographic inference from the input language, which shifts triage output even when symptom descriptions and severity scores remain constant.
If this is right
- Non-English prompts receive lower emergency-room rates unless a U.S. location is stated.
- Adding a location sentence can change the emergency-room recommendation by more than 70 percentage points for some languages.
- Back-translation to English restores the higher emergency-room rate, confirming the effect is tied to the original language cue.
- Severity scores stay nearly constant across languages, so the triage disparity is not explained by differences in perceived symptom seriousness.
Where Pith is reading between the lines
- The same language cue could alter other medical or safety decisions that depend on assumed location, such as recommended follow-up care or resource allocation.
- Testing the effect on additional models and symptom sets would show whether the pattern generalizes beyond the single model and single symptom profile examined here.
Load-bearing premise
The observed differences in emergency-room recommendations are produced by the model inferring the patient’s location from the language rather than by any other difference in how it processes the languages.
What would settle it
Running the identical symptom set in the six languages while always stating the same explicit location and finding that the emergency-room rates become statistically indistinguishable across languages.
read the original abstract
We investigate whether large language models produce different medical triage recommendations for identical symptoms based solely on the language of the patient prompt. Using Gemini 3.5 Flash, we evaluate a neurological symptom profile (persistent headache, blurred vision, nausea) across six languages (English, Spanish, Chinese, Hindi, Japanese, Arabic) with 30 runs per condition (n=450 total API calls). We find that the model recommends emergency room visits at rates ranging from 0% (Japanese, Hindi) to 30% (English, Arabic), despite assigning nearly identical severity scores (7.7-8.0/10) across all languages. Adding a single sentence specifying the patient's US location increases ER recommendations by up to 76.7 percentage points for non-English prompts, while the reverse anchor (English prompt with a Tokyo location) reduces the ER rate from 30% to 6.7%. A back-translation control (Japanese to English) produces ER rates comparable to the English baseline, confirming that the disparity is not caused by translation quality but by implicit geographic inference from the input language. We release the complete dataset, experiment code, and results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an empirical evaluation of Gemini 3.5 Flash on identical neurological symptoms (persistent headache, blurred vision, nausea) presented in six languages (English, Spanish, Chinese, Hindi, Japanese, Arabic). With 30 runs per condition, ER-visit recommendations range from 0% (Japanese, Hindi) to 30% (English, Arabic) despite nearly identical severity scores (7.7–8.0/10). Location-anchor sentences and a back-translation control are used to argue that the disparity arises from implicit geographic inference triggered by input language rather than translation artifacts or non-geographic language effects.
Significance. If the central pattern holds, the work identifies a concrete mechanism by which language can serve as a proxy for geography in LLM medical triage, with direct relevance to fairness and safety in multilingual healthcare AI. The experimental controls (US/Tokyo anchors shifting rates by up to 76.7 pp; back-translation restoring English-like rates) and the public release of the full dataset, code, and results are strengths that support reproducibility and further scrutiny.
major comments (2)
- [Results] Results (abstract and main text): ER recommendation rates are reported as raw percentages (0%–30%) with no accompanying statistical tests, confidence intervals, or standard errors. With n=30 per language, binomial proportion tests or Fisher exact tests comparing languages would be required to establish that the observed differences are unlikely under a null of no language effect.
- [Methods] Methods: The precise criteria used to classify model outputs as 'ER recommendation' versus other advice, and the exact prompt templates (including any system instructions or formatting differences across languages), are not described at a level that permits independent verification of response coding.
minor comments (1)
- The back-translation control is described only at a high level; an explicit statement of the translation model or service used and the exact Japanese-to-English prompt would increase transparency.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight opportunities to strengthen the statistical rigor and methodological transparency of the manuscript. We address each major comment below and will incorporate revisions in the next version.
read point-by-point responses
-
Referee: [Results] Results (abstract and main text): ER recommendation rates are reported as raw percentages (0%–30%) with no accompanying statistical tests, confidence intervals, or standard errors. With n=30 per language, binomial proportion tests or Fisher exact tests comparing languages would be required to establish that the observed differences are unlikely under a null of no language effect.
Authors: We agree that formal statistical tests are required to support the reported differences. In the revised manuscript we will add pairwise Fisher's exact tests (or binomial proportion tests) across languages, report 95% confidence intervals for each proportion, and include standard errors. These additions will be placed in both the abstract and the Results section alongside the existing percentages. revision: yes
-
Referee: [Methods] Methods: The precise criteria used to classify model outputs as 'ER recommendation' versus other advice, and the exact prompt templates (including any system instructions or formatting differences across languages), are not described at a level that permits independent verification of response coding.
Authors: We acknowledge that the current Methods section does not provide sufficient detail for independent replication of the coding process. Although the full prompt templates, system instructions, and classification code are already released in the public repository, we will expand the Methods section to explicitly state the classification criteria (keyword and semantic rules used to label an output as an ER recommendation) and reproduce the exact prompt templates for each language. revision: yes
Circularity Check
Empirical evaluation with no circular derivation
full rationale
The paper is a controlled empirical study that runs API calls on Gemini 3.5 Flash for identical symptoms in six languages, measures ER recommendation rates and severity scores, then applies explicit interventions (US/Tokyo location anchors and back-translation). No equations, parameter fitting, predictions derived from fitted inputs, or self-citation load-bearing steps appear in the reported design or results. The central pattern (language-linked ER disparity despite matched severity, shifted by location anchors) is measured directly from the API outputs and controls, with no reduction to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The symptom descriptions are semantically equivalent across languages after translation.
- domain assumption API responses are consistent and not affected by unaccounted model updates or randomness beyond the 30 runs.
Reference graph
Works this paper leans on
-
[1]
MEGA: Multilingual evaluation of generative AI
Ahuja, K., Diddee, H., Hada, R., Ochieng, M., Ramesh, K., Jain, P., Nambi, A., Ganu, T., Segal, S., Axmed, M., Bali, K., and Sitaram, S. MEGA: Multilingual evaluation of generative AI. In Proceedings of EMNLP, 2023
2023
-
[2]
Y., Pierson, E., Rose, S., Joshi, S., Ferryman, K., and Ghassemi, M
Chen, I. Y., Pierson, E., Rose, S., Joshi, S., Ferryman, K., and Ghassemi, M. Ethical machine learning in healthcare. Annual Review of Biomedical Data Science, 4:123--144, 2021
2021
-
[3]
J., and Bing, L
Deng, Y., Zhang, W., Pan, S. J., and Bing, L. Multilingual jailbreak challenges in large language models. In Proceedings of ICLR, 2024
2024
-
[4]
D., Kim, H., Santy, S., Sorensen, T., Lin, B
Li, H., Jiang, L., Hwang, J. D., Kim, H., Santy, S., Sorensen, T., Lin, B. Y., Dziri, N., Ren, X., and Choi, Y. Culture-Gen: Revealing global cultural perception in language models through natural language prompting. In Proceedings of COLM, 2024
2024
-
[5]
D., Ngo, N
Lai, V. D., Ngo, N. T., Veyseh, A. P. B., Man, H., Dernoncourt, F., Bui, T., and Nguyen, T. H. ChatGPT beyond English: Towards a comprehensive evaluation of large language models in multilingual learning. In Findings of EMNLP, 2023
2023
-
[6]
Large language models are geographically biased
Manvi, R., Khanna, S., Burke, M., Lobell, D., and Ermon, S. Large language models are geographically biased. In Proceedings of ICML, 2024
2024
-
[7]
Capabilities of GPT-4 on Medical Challenge Problems
Nori, H., King, N., McKinney, S. M., Carignan, D., and Horvitz, E. Capabilities of GPT-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Dissecting racial bias in an algorithm used to manage the health of populations
Obermeyer, Z., Powers, B., Vogeli, C., and Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447--453, 2019
2019
-
[9]
A., Lester, J
Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V., and Daneshjou, R. Large language models propagate race-based medicine. NPJ Digital Medicine, 6(1):195, 2023
2023
-
[10]
a rli, N., Chowdhery, A., Mansfield, P., Demner-Fushman, D., Ag\
Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamber, P., Kelly, C., Babiker, A., Sch\" a rli, N., Chowdhery, A., Mansfield, P., Demner-Fushman, D., Ag\" u era y Arcas, B., Webster, D., Corrado, G. S., Matias, Y., Chou, K., Gottweis, J., Tomasev, N., L...
2023
-
[11]
Low-Resource Languages Jailbreak GPT-4
Yong, Z. X., Menghini, C., and Bach, S. H. Low-resource languages jailbreak GPT-4. arXiv preprint arXiv:2310.02446, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
A., Celi, L
Zack, T., Lehman, E., Suzgun, M., Rodriguez, J. A., Celi, L. A., Gichoya, J., Jurafsky, D., Szolovits, P., Bates, D. W., Abdulnour, R. E., Buber, A., and Altman, R. B. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: A model evaluation study. The Lancet Digital Health, 6(1):e12--e22, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.