pith. sign in

arxiv: 2606.05187 · v1 · pith:TNCLAGLTnew · submitted 2026-04-28 · 💻 cs.CY · cs.AI

Geographic Bias and Diversity in AI Evaluation

Pith reviewed 2026-07-01 08:42 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords geographic biasAI evaluationgenerative AIdiversityrepresentation biasdefaultsfactual recalllanguage models
0
0 comments X

The pith

Generative AI tends to over-proportionally favor prototypical places called defaults.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This literature review surveys geographic bias in AI before and after the rise of generative models. It catalogs biases such as uneven representation in training data, uneven factual recall by region in language models, and generative outputs that default to a narrow set of prototypical places. The paper then describes recent evaluation work that measures geographic diversity by changing cognitive task levels, model parameters, and output formats. A reader would care because these patterns could distort downstream uses in areas like biodiversity tracking and disaster response.

Core claim

The authors state that geographic biases in AI include representation bias in training data, regional disparities in factual recall, and the tendency of generative AI to over-proportionally favor prototypical places (called defaults). They show that recent studies address the latter by evaluating geographic diversity across cognitive levels, parameter settings, and output modalities.

What carries the argument

The notion of defaults—prototypical places that generative models select disproportionately—together with evaluation methods that vary cognitive levels, parameter settings, and output modalities to test for geographic diversity.

If this is right

  • AI systems used for biodiversity or disaster mitigation may systematically under-represent or distort non-default locations.
  • Benchmarks for geographic unbiasedness must incorporate tests across multiple cognitive levels and output modalities.
  • Training data imbalances directly contribute to factual recall gaps and default favoritism in model outputs.
  • Parameter changes and modality shifts can be used as levers to increase measured geographic diversity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could run controlled prompts on underrepresented regions to measure the strength of default bias in specific models.
  • Audits for AI deployed in global decision systems might require explicit geographic coverage metrics.
  • Altering the spatial distribution of training examples could reduce default favoritism without changing model architecture.

Load-bearing premise

The body of literature reviewed provides a comprehensive and representative picture of geographic bias issues across pre-generative and generative AI periods.

What would settle it

An empirical test in which generative models, when prompted across many regions and modalities with controlled parameters, produce outputs whose geographic distribution matches real-world population or feature distributions at rates statistically indistinguishable from chance.

Figures

Figures reproduced from arXiv: 2606.05187 by Gengchen Mai, Krzysztof Janowicz, Rui Zhu, Song Gao, Zilong Liu.

Figure 1
Figure 1. Figure 1: Illustration of representation bias with an example of four regions. The development sample includes Regions 1–3 for training the model. The model is then deployed to Region 4, which is the use population not represented in training. Before, Google Brain brought the lack of geographic diversity to attention in their examination of two of the most widely used benchmark datasets in computer vision [36]: Open… view at source ↗
Figure 2
Figure 2. Figure 2: illustrates these two dimensions of MAUP. In terms of the scale effect, aggregating all crime observations into two horizontal units smaller than the one large horizontal unit may lead to a different statistical conclusion about regional safety. With respect to the shape effect, repartitioning the same area into two vertical units yet produces another different conclusion. Large-Scale Horizontal Aggregatio… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of how a protected attribute (e.g., race) can influence outcomes in risk-assessment tools such as COMPAS. The left branch (in red) represents the case of Brisha Borden, who received a risk score of 8 despite not reoffending, while the right branch (in green) corresponds to Vernon Prater, who received a risk score of 3 but later reoffended. The two cases originate from the ProPublica story [4].… view at source ↗
Figure 4
Figure 4. Figure 4: A geoparsing pipeline showing how “Rome” is extracted and resolved. The final choice reflects an algorithmic bias towards more populous cities. For both tasks, language models—whether neural or not—struggle to perform equally well across different places [19]6 . However, this phenomenon was not attributed 6 In fact, there is now a growing body of generative AI research on this problem, expanding the resear… view at source ↗
Figure 5
Figure 5. Figure 5: A knowledge-probing pipeline where an LLM predicts a masked token (e.g., “France”) from a cloze sentence. In such a context, geographic bias is regarded as systematic geographical disparities in LLM factual recall [25,23]. Using World Bank data, it has been discovered that the error rates in the factual recall of 20 LLMs were 1.5 times higher for Sub-Saharan African countries than for North American countr… view at source ↗
Figure 6
Figure 6. Figure 6: Experiments in which a user prompts a generative AI model in multiple sessions, producing uneven distributions of place name outputs. 4.1. Measurement of Geographic Diversity Diversity is a natural indicator for quantifying this new geographic bias. This is because it has long been used to quantify similar phenomena on the richness and evenness of species in ecological studies, forming a natural analogy wi… view at source ↗
Figure 7
Figure 7. Figure 7: Diversity profiles for distributions with and without considering similarity. This plot also shows that the diversity of an even distribution remains constant across orders q, while the diversity of an uneven distribution declines as q increases. 4.2. Findings about (a Lack of) Geographic Diversity The measurement of geographic diversity yields many interesting findings about the outputs of popular generat… view at source ↗
Figure 8
Figure 8. Figure 8: The average order-1 geographic diversity versus sampling temperature across three studied models from the work of [21]. Outcome Probability Reference Distribution AI Output [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Bias as a systematic deviation (with respect to the mean in this example) of the AI output from a statistical reference distribution. 5. Conclusions The introduction of geographic diversity highlights that diversity should not only be articulated but also be made spatially explicit and measurable for the evaluation of (generative) AI systems. This leads to two subsequent research questions. First, what sho… view at source ↗
read the original abstract

Among the many challenges hindering the responsible development and deployment of AI, arguably none has faced more intense scrutiny than bias in its various forms. This underscores the widespread concerns across AI researchers that model outputs, e.g., from generative AI, may encode structural distributional imbalances (stemming from training data or model design) that may amplify social inequality or introduce systemic distortions across application domains ranging from biodiversity to disaster mitigation. Yet, relatively little work has investigated the geographical nature of bias or developed measurable benchmarks for what it means for (generative) AI to be unbiased. In this chapter, we investigate this issue through a literature review. As foundation models are reshaping the landscape of bias research, we examine work spanning both the pre-generative AI and generative AI periods. First, we identify a range of geographic biases. These biases span from representation bias in the training data and regional disparities in the factual recall of language models to the tendency of generative AI to over-proportionally favor prototypical places (called defaults). Then, we showcase how recent studies address the latter bias by evaluating geographic diversity in the outputs of generative AI across various cognitive levels, parameter settings, and output modalities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript is a literature review on geographic bias in AI, spanning pre-generative and generative periods. It identifies biases including representation bias in training data, regional disparities in factual recall by language models, and generative AI's tendency to over-favor prototypical places (termed 'defaults'). It further claims that recent studies evaluate geographic diversity in generative AI outputs across cognitive levels, parameter settings, and output modalities.

Significance. If the reviewed literature is comprehensive and representative, the work would usefully synthesize an under-examined dimension of AI bias and could inform benchmark development for geographic fairness. The absence of any methodological details, however, prevents assessment of whether the synthesis accurately reflects the state of the field.

major comments (2)
  1. [Abstract] Abstract: The description of the literature review provides no details on search methodology, databases, keywords, inclusion/exclusion criteria, or time spans covered. This is load-bearing for the central claims, which rest entirely on the representativeness of the selected studies (as the skeptic note correctly identifies).
  2. [Abstract] Abstract: The claim that 'generative AI tends to over-proportionally favor prototypical places' and that 'recent studies evaluate geographic diversity across cognitive levels, parameter settings, and output modalities' is presented as a synthesis result, yet no quantitative information (number of studies per category, explicit selection criteria) is supplied to allow verification or to rule out systematic omission of counter-examples or non-English work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on our literature review. The comments correctly identify a lack of methodological transparency that weakens the manuscript's claims about representativeness. We will revise to address this.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The description of the literature review provides no details on search methodology, databases, keywords, inclusion/exclusion criteria, or time spans covered. This is load-bearing for the central claims, which rest entirely on the representativeness of the selected studies (as the skeptic note correctly identifies).

    Authors: We agree the current abstract and text omit these details, which is a substantive weakness. In revision we will add a dedicated 'Methods' subsection describing the search strategy (databases: Google Scholar, arXiv, ACM; keywords: geographic bias, spatial bias in AI, generative defaults; time span: 2015–2024; inclusion: English-language peer-reviewed and preprint works explicitly addressing geographic dimensions of bias; exclusion: purely technical papers without bias analysis). The abstract will be updated to reference this section. revision: yes

  2. Referee: [Abstract] Abstract: The claim that 'generative AI tends to over-proportionally favor prototypical places' and that 'recent studies evaluate geographic diversity across cognitive levels, parameter settings, and output modalities' is presented as a synthesis result, yet no quantitative information (number of studies per category, explicit selection criteria) is supplied to allow verification or to rule out systematic omission of counter-examples or non-English work.

    Authors: The manuscript is a narrative rather than systematic review, so quantitative tallies were not originally provided. We will revise by inserting a summary table or paragraph stating the number of studies per bias category and per evaluation dimension, restating the explicit selection criteria, and adding an explicit limitation note on the exclusion of non-English literature. This will make the synthesis claims verifiable without altering their substance. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive literature survey with no derivations or self-referential reductions

full rationale

The paper is explicitly a literature review synthesizing prior work on geographic bias in AI across pre-generative and generative periods. It contains no equations, fitted parameters, predictions, or derivation chains. All claims about biases and evaluation studies are presented as summaries of external literature rather than internally derived results. The representativeness of the cited body is an external assumption about coverage, not a reduction of the paper's own logic to its inputs by construction. No self-citation load-bearing steps or other enumerated patterns are present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Literature review paper; no new mathematical models, parameters, or entities introduced.

pith-pipeline@v0.9.1-grok · 5736 in / 885 out tokens · 25187 ms · 2026-07-01T08:42:34.412618+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    A learning algorithm for boltzmann machines.Cognitive science, 9(1):147–169, 1985

    David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines.Cognitive science, 9(1):147–169, 1985

  2. [2]

    Equal credit opportunity act.Women in the American Political System: An Encyclopedia of Women as Voters, Candidates, and Office Holders, 2:129, 2018

    Equal Credit Opportunity Act. Equal credit opportunity act.Women in the American Political System: An Encyclopedia of Women as Voters, Candidates, and Office Holders, 2:129, 2018

  3. [3]

    Fair housing act.Home Mortgage Disclosure Act, and Community, 1968

    Fair Housing Act. Fair housing act.Home Mortgage Disclosure Act, and Community, 1968

  4. [4]

    Machine bias: Risk assessments in criminal sentencing.ProPublica, May 23, 2016

    Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: Risk assessments in criminal sentencing.ProPublica, May 23, 2016. URL:https://www.propublica.org/article/ machine-bias-risk-assessments-in-criminal-sentencing

  5. [5]

    What is special about spatial data?: alternative perspectives on spatial data analysis.Technical paper/National Center for Geographic Information and Analysis (89-4), 1989

    L Anselin. What is special about spatial data?: alternative perspectives on spatial data analysis.Technical paper/National Center for Geographic Information and Analysis (89-4), 1989

  6. [6]

    Man is to computer programmer as woman is to homemaker? debiasing word embeddings.Advances in neural information processing systems, 29, 2016

    Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings.Advances in neural information processing systems, 29, 2016

  7. [7]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021

  8. [8]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  9. [9]

    Fairness under unawareness: Assessing disparity when protected class is unobserved

    Jiahao Chen, Nathan Kallus, Xiaojie Mao, Geoffry Svacha, and Madeleine Udell. Fairness under unawareness: Assessing disparity when protected class is unobserved. InProceedings of the conference on fairness, accountability, and transparency, pages 339–348, 2019

  10. [10]

    The openshaw effect.International Journal of Geographical Information Science, 36(9):1697–1698, 2022

    Michael F Goodchild. The openshaw effect.International Journal of Geographical Information Science, 36(9):1697–1698, 2022

  11. [11]

    Replication across space and time must be weak in the social and environmental sciences.Proceedings of the National Academy of Sciences, 118(35):e2015759118, 2021

    Michael F Goodchild and Wenwen Li. Replication across space and time must be weak in the social and environmental sciences.Proceedings of the National Academy of Sciences, 118(35):e2015759118, 2021

  12. [12]

    Diversity and evenness: a unifying notation and its consequences.Ecology, 54(2):427–432, 1973

    Mark O Hill. Diversity and evenness: a unifying notation and its consequences.Ecology, 54(2):427–432, 1973

  13. [13]

    Whose truth? pluralistic geo-alignment for (agentic) ai

    Krzysztof Janowicz, Zilong Liu, Gengchen Mai, Zhangyu Wang, Ivan Majic, Alexandra Fortacz, Grant McKenzie, and Song Gao. Whose truth? pluralistic geo-alignment for (agentic) ai. InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems, pages 799–803, 2025

  14. [14]

    Entropy and diversity.Oikos, 113(2):363–375, 2006

    Lou Jost. Entropy and diversity.Oikos, 113(2):363–375, 2006

  15. [15]

    Things and strings: improving place name disambiguation from short texts by combining entity co-occurrence with topic modeling

    Yiting Ju, Benjamin Adams, Krzysztof Janowicz, Yingjie Hu, Bo Yan, and Grant McKenzie. Things and strings: improving place name disambiguation from short texts by combining entity co-occurrence with topic modeling. InEuropean Knowledge Acquisition Workshop, pages 353–367. Springer, 2016

  16. [16]

    Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

  17. [17]

    Bringing spatial interaction measures into multi-criteria assessment of redistricting plans using interactive web mapping

    Jacob Kruse, Song Gao, Yuhan Ji, Daniel P Szabo, and Kenneth R Mayer. Bringing spatial interaction measures into multi-criteria assessment of redistricting plans using interactive web mapping. Cartography and Geographic Information Science, 51(4):513–532, 2024

  18. [18]

    Measuring diversity: the importance of species similarity

    Tom Leinster and Christina A Cobbold. Measuring diversity: the importance of species similarity. Ecology, 93(3):477–489, 2012

  19. [19]

    Geoparsing: Solved or biased? an evaluation of geographic biases in geoparsing.AGILE: GIScience Series, 3:9, 2022

    Zilong Liu, Krzysztof Janowicz, Ling Cai, Rui Zhu, Gengchen Mai, and Meilin Shi. Geoparsing: Solved or biased? an evaluation of geographic biases in geoparsing.AGILE: GIScience Series, 3:9, 2022

  20. [20]

    Assessing the geographic diversity of ai’s platial representations in image generation

    Zilong Liu, Krzysztof Janowicz, and Mina Karimi. Assessing the geographic diversity of ai’s platial representations in image generation. InAGILE: GIScience Series, 2026. Accepted for publication

  21. [21]

    Golden gate bridge, as always? eliciting prototypical places from autoregressive large language models via category production.Transactions in GIS

    Zilong Liu, Krzysztof Janowicz, Mina Karimi, Meilin Shi, Ivan Majic, and Alexandra Fortacz. Golden gate bridge, as always? eliciting prototypical places from autoregressive large language models via category production.Transactions in GIS. Accepted for publication

  22. [22]

    Operationalizing geographic diversity for the evaluation of ai-generated content

    Zilong Liu, Krzysztof Janowicz, Ivan Majic, Meilin Shi, Alexandra Fortacz, Mina Karimi, Gengchen Mai, and Kitty Currier. Operationalizing geographic diversity for the evaluation of ai-generated content. Transactions in GIS, 29(3):e70057, 2025

  23. [23]

    On the opportunities and challenges of foundation models for geoai (vision paper).ACM Transactions on Spatial Algorithms and Systems, 10(2):1–46, 2024

    Gengchen Mai, Weiming Huang, Jin Sun, Suhang Song, Deepak Mishra, Ninghao Liu, Song Gao, Tianming Liu, Gao Cong, Yingjie Hu, et al. On the opportunities and challenges of foundation models for geoai (vision paper).ACM Transactions on Spatial Algorithms and Systems, 10(2):1–46, 2024

  24. [24]

    Large language models are geographically biased

    Rohin Manvi, Samar Khanna, Marshall Burke, David Lobell, and Stefano Ermon. Large language models are geographically biased. InProceedings of the 41st International Conference on Machine Learning, pages 34654–34669, 2024

  25. [25]

    Geollm: Extracting geospatial knowledge from large language models

    Rohin Manvi, Samar Khanna, Gengchen Mai, Marshall Burke, David B Lobell, and Stefano Ermon. Geollm: Extracting geospatial knowledge from large language models. InThe Twelfth International Conference on Learning Representations, 2024

  26. [26]

    A survey on bias and fairness in machine learning.ACM computing surveys (CSUR), 54(6):1–35, 2021

    Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning.ACM computing surveys (CSUR), 54(6):1–35, 2021

  27. [27]

    Efficient Estimation of Word Representations in Vector Space

    Tomas Mikolov. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 3781, 2013

  28. [28]

    Distributed representations of words and phrases and their compositionality.Advances in neural information processing systems, 26, 2013

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality.Advances in neural information processing systems, 26, 2013

  29. [29]

    Worldbench: Quantifying geographic disparities in llm factual recall

    Mazda Moayeri, Elham Tabassi, and Soheil Feizi. Worldbench: Quantifying geographic disparities in llm factual recall. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1211–1228, 2024

  30. [30]

    Notes on continuous stochastic phenomena.Biometrika, 37(1/2):17–23, 1950

    Patrick AP Moran. Notes on continuous stochastic phenomena.Biometrika, 37(1/2):17–23, 1950

  31. [31]

    Social biases through the text-to-image generation lens

    Ranjita Naik and Besmira Nushi. Social biases through the text-to-image generation lens. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 786–808, 2023

  32. [32]

    The modifiable areal unit problem.Concepts and techniques in modern geography, 1984

    Stan Openshaw. The modifiable areal unit problem.Concepts and techniques in modern geography, 1984

  33. [33]

    Fabio Petroni, Tim Rockt ¨aschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, 2019

  34. [34]

    Ai’s regimes of representation: A community-centered study of text-to-image models in south asia

    Rida Qadri, Renee Shelby, Cynthia L Bennett, and Remi Denton. Ai’s regimes of representation: A community-centered study of text-to-image models in south asia. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 506–517, 2023

  35. [35]

    Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015

  36. [36]

    No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World

    Shreya Shankar, Yoni Halpern, Eric Breck, James Atwood, Jimbo Wilson, and D Sculley. No classification without representation: Assessing geodiversity issues in open data sets for the developing world.arXiv preprint arXiv:1711.08536, 2017

  37. [37]

    A mathematical theory of communication.ACM SIGMOBILE mobile computing and communications review, 5(1):3–55, 2001

    Claude Elwood Shannon. A mathematical theory of communication.ACM SIGMOBILE mobile computing and communications review, 5(1):3–55, 2001

  38. [38]

    Measurement of diversity.Nature, 163, 1949

    EH Simpson. Measurement of diversity.Nature, 163, 1949

  39. [39]

    A Roadmap to Pluralistic Alignment

    Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christo- pher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, et al. A roadmap to pluralistic alignment.arXiv preprint arXiv:2402.05070, 2024

  40. [40]

    A framework for understanding sources of harm throughout the machine learning life cycle

    Harini Suresh and John Guttag. A framework for understanding sources of harm throughout the machine learning life cycle. InProceedings of the 1st ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, pages 1–9, 2021

  41. [41]

    Neurotpr: A neuro-net toponym recognition model for extracting locations from social media messages.Transactions in GIS, 24(3):719–735, 2020.doi: 10.1111/tgis.12627

    Jimin Wang, Yingjie Hu, and Kenneth Joseph. Neurotpr: A neuro-net toponym recognition model for extracting locations from social media messages.Transactions in GIS, 24(3):719–735, 2020.doi: 10.1111/tgis.12627

  42. [42]

    Torchspatial: A location encoding framework and benchmark for spatial representation learning

    Nemin Wu, Qian Cao, Zhangyu Wang, Zeping Liu, Yanlin Qi, Jielu Zhang, Joshua Ni, Xiaobai Yao, Hongxu Ma, Lan Mu, Stefano Ermon, Tanuja Ganu, Akshay Nambi, Ni Lao, and Gengchen Mai. Torchspatial: A location encoding framework and benchmark for spatial representation learning. Advances in Neural Information Processing Systems, 37:81437–81460, 2024