Implicit Geographic Inference in LLM Medical Triage: Language-Driven Disparities in Emergency Recommendations

Qi Han Wong

arxiv: 2606.01204 · v1 · pith:5K3XQQG5new · submitted 2026-05-31 · 💻 cs.CL · cs.AI· cs.CY

Implicit Geographic Inference in LLM Medical Triage: Language-Driven Disparities in Emergency Recommendations

Qi Han Wong This is my paper

Pith reviewed 2026-06-28 16:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY

keywords LLM medical triagelanguage disparitiesgeographic inferenceemergency recommendationsimplicit biasmultilingual promptingtriage consistency

0 comments

The pith

Large language models recommend emergency room visits for identical symptoms at rates from 0% to 30% depending only on the language of the query.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether an LLM produces different medical triage advice for the same neurological symptoms when the prompt is written in English, Spanish, Chinese, Hindi, Japanese or Arabic. It finds that emergency-room recommendations vary sharply by language even though the model assigns nearly the same severity score in every case. Adding one sentence that states the patient is in the United States raises the emergency-room rate for non-English prompts by as much as 76.7 percentage points; adding a Tokyo location to an English prompt lowers it from 30% to 6.7%. A back-translation check shows the effect is not produced by translation quality. The central claim is therefore that the model draws an implicit geographic inference from the language alone and uses that inference to adjust its triage decision.

Core claim

When the same symptom profile is presented in six languages, the model recommends an emergency-room visit at rates ranging from 0% (Japanese, Hindi) to 30% (English, Arabic) while giving almost identical severity scores (7.7–8.0/10). Explicitly anchoring the patient to a U.S. location raises the emergency-room rate for non-English prompts by up to 76.7 percentage points; anchoring an English prompt to Tokyo lowers the rate to 6.7%. A back-translation control produces rates comparable to the English baseline, indicating that the disparity arises from implicit geographic inference drawn from the input language rather than from translation artifacts or prompt formatting.

What carries the argument

Implicit geographic inference from the input language, which shifts triage output even when symptom descriptions and severity scores remain constant.

If this is right

Non-English prompts receive lower emergency-room rates unless a U.S. location is stated.
Adding a location sentence can change the emergency-room recommendation by more than 70 percentage points for some languages.
Back-translation to English restores the higher emergency-room rate, confirming the effect is tied to the original language cue.
Severity scores stay nearly constant across languages, so the triage disparity is not explained by differences in perceived symptom seriousness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same language cue could alter other medical or safety decisions that depend on assumed location, such as recommended follow-up care or resource allocation.
Testing the effect on additional models and symptom sets would show whether the pattern generalizes beyond the single model and single symptom profile examined here.

Load-bearing premise

The observed differences in emergency-room recommendations are produced by the model inferring the patient’s location from the language rather than by any other difference in how it processes the languages.

What would settle it

Running the identical symptom set in the six languages while always stating the same explicit location and finding that the emergency-room rates become statistically indistinguishable across languages.

read the original abstract

We investigate whether large language models produce different medical triage recommendations for identical symptoms based solely on the language of the patient prompt. Using Gemini 3.5 Flash, we evaluate a neurological symptom profile (persistent headache, blurred vision, nausea) across six languages (English, Spanish, Chinese, Hindi, Japanese, Arabic) with 30 runs per condition (n=450 total API calls). We find that the model recommends emergency room visits at rates ranging from 0% (Japanese, Hindi) to 30% (English, Arabic), despite assigning nearly identical severity scores (7.7-8.0/10) across all languages. Adding a single sentence specifying the patient's US location increases ER recommendations by up to 76.7 percentage points for non-English prompts, while the reverse anchor (English prompt with a Tokyo location) reduces the ER rate from 30% to 6.7%. A back-translation control (Japanese to English) produces ER rates comparable to the English baseline, confirming that the disparity is not caused by translation quality but by implicit geographic inference from the input language. We release the complete dataset, experiment code, and results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gemini triage rates for the same symptoms swing 0-30% by prompt language because the model infers location from it, and the location-anchor plus back-translation controls make the claim hold up.

read the letter

The paper's core finding is that identical neurological symptoms produce ER recommendations ranging from 0% (Japanese, Hindi) to 30% (English, Arabic) in Gemini 3.5 Flash, even though severity scores stay nearly the same. Explicit US-location sentences raise non-English ER rates by up to 76.7 points, while a Tokyo anchor drops the English rate to 6.7%. Back-translation to English restores the higher rate, so the effect tracks language-as-location proxy rather than translation artifacts.

The work is straightforward and the controls are well-targeted. Releasing the full dataset, code, and results is useful. The design directly tests the main alternative explanations, and the pattern is consistent with the stated mechanism.

The main limitations are narrow scope: one model, one symptom profile, and 30 runs per cell. We do not yet know whether the same language-location shortcut appears in other models or for different medical presentations. No statistical tests or variance numbers are visible in the abstract, though the raw data release lets others check that.

This is the kind of targeted empirical result that belongs in the fairness and healthcare-AI literature. A serious referee should see it; the evidence on its own terms is clean enough to review even if later work broadens the conditions.

Referee Report

2 major / 1 minor

Summary. The manuscript reports an empirical evaluation of Gemini 3.5 Flash on identical neurological symptoms (persistent headache, blurred vision, nausea) presented in six languages (English, Spanish, Chinese, Hindi, Japanese, Arabic). With 30 runs per condition, ER-visit recommendations range from 0% (Japanese, Hindi) to 30% (English, Arabic) despite nearly identical severity scores (7.7–8.0/10). Location-anchor sentences and a back-translation control are used to argue that the disparity arises from implicit geographic inference triggered by input language rather than translation artifacts or non-geographic language effects.

Significance. If the central pattern holds, the work identifies a concrete mechanism by which language can serve as a proxy for geography in LLM medical triage, with direct relevance to fairness and safety in multilingual healthcare AI. The experimental controls (US/Tokyo anchors shifting rates by up to 76.7 pp; back-translation restoring English-like rates) and the public release of the full dataset, code, and results are strengths that support reproducibility and further scrutiny.

major comments (2)

[Results] Results (abstract and main text): ER recommendation rates are reported as raw percentages (0%–30%) with no accompanying statistical tests, confidence intervals, or standard errors. With n=30 per language, binomial proportion tests or Fisher exact tests comparing languages would be required to establish that the observed differences are unlikely under a null of no language effect.
[Methods] Methods: The precise criteria used to classify model outputs as 'ER recommendation' versus other advice, and the exact prompt templates (including any system instructions or formatting differences across languages), are not described at a level that permits independent verification of response coding.

minor comments (1)

The back-translation control is described only at a high level; an explicit statement of the translation model or service used and the exact Japanese-to-English prompt would increase transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the statistical rigor and methodological transparency of the manuscript. We address each major comment below and will incorporate revisions in the next version.

read point-by-point responses

Referee: [Results] Results (abstract and main text): ER recommendation rates are reported as raw percentages (0%–30%) with no accompanying statistical tests, confidence intervals, or standard errors. With n=30 per language, binomial proportion tests or Fisher exact tests comparing languages would be required to establish that the observed differences are unlikely under a null of no language effect.

Authors: We agree that formal statistical tests are required to support the reported differences. In the revised manuscript we will add pairwise Fisher's exact tests (or binomial proportion tests) across languages, report 95% confidence intervals for each proportion, and include standard errors. These additions will be placed in both the abstract and the Results section alongside the existing percentages. revision: yes
Referee: [Methods] Methods: The precise criteria used to classify model outputs as 'ER recommendation' versus other advice, and the exact prompt templates (including any system instructions or formatting differences across languages), are not described at a level that permits independent verification of response coding.

Authors: We acknowledge that the current Methods section does not provide sufficient detail for independent replication of the coding process. Although the full prompt templates, system instructions, and classification code are already released in the public repository, we will expand the Methods section to explicitly state the classification criteria (keyword and semantic rules used to label an output as an ER recommendation) and reproduce the exact prompt templates for each language. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation with no circular derivation

full rationale

The paper is a controlled empirical study that runs API calls on Gemini 3.5 Flash for identical symptoms in six languages, measures ER recommendation rates and severity scores, then applies explicit interventions (US/Tokyo location anchors and back-translation). No equations, parameter fitting, predictions derived from fitted inputs, or self-citation load-bearing steps appear in the reported design or results. The central pattern (language-linked ER disparity despite matched severity, shifted by location anchors) is measured directly from the API outputs and controls, with no reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on the validity of the experimental controls and the assumption that language triggers geographic inference in the model.

axioms (2)

domain assumption The symptom descriptions are semantically equivalent across languages after translation.
The experiment relies on this to attribute differences to language inference rather than content differences.
domain assumption API responses are consistent and not affected by unaccounted model updates or randomness beyond the 30 runs.
Standard assumption in LLM evaluation studies.

pith-pipeline@v0.9.1-grok · 5734 in / 1395 out tokens · 38128 ms · 2026-06-28T16:55:01.920057+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 2 canonical work pages · 2 internal anchors

[1]

MEGA: Multilingual evaluation of generative AI

Ahuja, K., Diddee, H., Hada, R., Ochieng, M., Ramesh, K., Jain, P., Nambi, A., Ganu, T., Segal, S., Axmed, M., Bali, K., and Sitaram, S. MEGA: Multilingual evaluation of generative AI. In Proceedings of EMNLP, 2023

2023
[2]

Y., Pierson, E., Rose, S., Joshi, S., Ferryman, K., and Ghassemi, M

Chen, I. Y., Pierson, E., Rose, S., Joshi, S., Ferryman, K., and Ghassemi, M. Ethical machine learning in healthcare. Annual Review of Biomedical Data Science, 4:123--144, 2021

2021
[3]

J., and Bing, L

Deng, Y., Zhang, W., Pan, S. J., and Bing, L. Multilingual jailbreak challenges in large language models. In Proceedings of ICLR, 2024

2024
[4]

D., Kim, H., Santy, S., Sorensen, T., Lin, B

Li, H., Jiang, L., Hwang, J. D., Kim, H., Santy, S., Sorensen, T., Lin, B. Y., Dziri, N., Ren, X., and Choi, Y. Culture-Gen: Revealing global cultural perception in language models through natural language prompting. In Proceedings of COLM, 2024

2024
[5]

D., Ngo, N

Lai, V. D., Ngo, N. T., Veyseh, A. P. B., Man, H., Dernoncourt, F., Bui, T., and Nguyen, T. H. ChatGPT beyond English: Towards a comprehensive evaluation of large language models in multilingual learning. In Findings of EMNLP, 2023

2023
[6]

Large language models are geographically biased

Manvi, R., Khanna, S., Burke, M., Lobell, D., and Ermon, S. Large language models are geographically biased. In Proceedings of ICML, 2024

2024
[7]

Capabilities of GPT-4 on Medical Challenge Problems

Nori, H., King, N., McKinney, S. M., Carignan, D., and Horvitz, E. Capabilities of GPT-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Dissecting racial bias in an algorithm used to manage the health of populations

Obermeyer, Z., Powers, B., Vogeli, C., and Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447--453, 2019

2019
[9]

A., Lester, J

Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V., and Daneshjou, R. Large language models propagate race-based medicine. NPJ Digital Medicine, 6(1):195, 2023

2023
[10]

a rli, N., Chowdhery, A., Mansfield, P., Demner-Fushman, D., Ag\

Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamber, P., Kelly, C., Babiker, A., Sch\" a rli, N., Chowdhery, A., Mansfield, P., Demner-Fushman, D., Ag\" u era y Arcas, B., Webster, D., Corrado, G. S., Matias, Y., Chou, K., Gottweis, J., Tomasev, N., L...

2023
[11]

Low-Resource Languages Jailbreak GPT-4

Yong, Z. X., Menghini, C., and Bach, S. H. Low-resource languages jailbreak GPT-4. arXiv preprint arXiv:2310.02446, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

A., Celi, L

Zack, T., Lehman, E., Suzgun, M., Rodriguez, J. A., Celi, L. A., Gichoya, J., Jurafsky, D., Szolovits, P., Bates, D. W., Abdulnour, R. E., Buber, A., and Altman, R. B. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: A model evaluation study. The Lancet Digital Health, 6(1):e12--e22, 2024

2024

[1] [1]

MEGA: Multilingual evaluation of generative AI

Ahuja, K., Diddee, H., Hada, R., Ochieng, M., Ramesh, K., Jain, P., Nambi, A., Ganu, T., Segal, S., Axmed, M., Bali, K., and Sitaram, S. MEGA: Multilingual evaluation of generative AI. In Proceedings of EMNLP, 2023

2023

[2] [2]

Y., Pierson, E., Rose, S., Joshi, S., Ferryman, K., and Ghassemi, M

Chen, I. Y., Pierson, E., Rose, S., Joshi, S., Ferryman, K., and Ghassemi, M. Ethical machine learning in healthcare. Annual Review of Biomedical Data Science, 4:123--144, 2021

2021

[3] [3]

J., and Bing, L

Deng, Y., Zhang, W., Pan, S. J., and Bing, L. Multilingual jailbreak challenges in large language models. In Proceedings of ICLR, 2024

2024

[4] [4]

D., Kim, H., Santy, S., Sorensen, T., Lin, B

Li, H., Jiang, L., Hwang, J. D., Kim, H., Santy, S., Sorensen, T., Lin, B. Y., Dziri, N., Ren, X., and Choi, Y. Culture-Gen: Revealing global cultural perception in language models through natural language prompting. In Proceedings of COLM, 2024

2024

[5] [5]

D., Ngo, N

Lai, V. D., Ngo, N. T., Veyseh, A. P. B., Man, H., Dernoncourt, F., Bui, T., and Nguyen, T. H. ChatGPT beyond English: Towards a comprehensive evaluation of large language models in multilingual learning. In Findings of EMNLP, 2023

2023

[6] [6]

Large language models are geographically biased

Manvi, R., Khanna, S., Burke, M., Lobell, D., and Ermon, S. Large language models are geographically biased. In Proceedings of ICML, 2024

2024

[7] [7]

Capabilities of GPT-4 on Medical Challenge Problems

Nori, H., King, N., McKinney, S. M., Carignan, D., and Horvitz, E. Capabilities of GPT-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Dissecting racial bias in an algorithm used to manage the health of populations

Obermeyer, Z., Powers, B., Vogeli, C., and Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447--453, 2019

2019

[9] [9]

A., Lester, J

Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V., and Daneshjou, R. Large language models propagate race-based medicine. NPJ Digital Medicine, 6(1):195, 2023

2023

[10] [10]

a rli, N., Chowdhery, A., Mansfield, P., Demner-Fushman, D., Ag\

Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamber, P., Kelly, C., Babiker, A., Sch\" a rli, N., Chowdhery, A., Mansfield, P., Demner-Fushman, D., Ag\" u era y Arcas, B., Webster, D., Corrado, G. S., Matias, Y., Chou, K., Gottweis, J., Tomasev, N., L...

2023

[11] [11]

Low-Resource Languages Jailbreak GPT-4

Yong, Z. X., Menghini, C., and Bach, S. H. Low-resource languages jailbreak GPT-4. arXiv preprint arXiv:2310.02446, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

A., Celi, L

Zack, T., Lehman, E., Suzgun, M., Rodriguez, J. A., Celi, L. A., Gichoya, J., Jurafsky, D., Szolovits, P., Bates, D. W., Abdulnour, R. E., Buber, A., and Altman, R. B. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: A model evaluation study. The Lancet Digital Health, 6(1):e12--e22, 2024

2024