pith. machine review for the scientific record. sign in

arxiv: 2604.14672 · v1 · submitted 2026-04-16 · 💻 cs.CL

Recognition: unknown

SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords spatial gender biaslarge language modelsurban micro-spacesbias evaluationgender stereotypesLLM pipeline tracing
0
0 comments X

The pith

Large language models embed structured gender biases into specific urban micro-spaces that exceed real-world patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops SPAGBias as a framework to measure how LLMs link genders to 62 urban micro-spaces through forced-choice prompts, token probabilities, and narrative analysis. It shows these links form detailed patterns beyond broad public-private splits and appear stronger than actual city usage data. The work traces the patterns to pre-training, tuning, and reward stages while testing effects on story generation and downstream tasks. A reader would care because LLMs now support urban planning decisions where biased space associations could shape real designs and access. The central result is that these spatial gender mappings are measurable, traceable, and consequential.

Core claim

SPAGBias identifies structured gender-space associations in LLMs that go beyond the public-private divide and form nuanced micro-level mappings, with model outputs substantially exceeding real-world distributions; these patterns are embedded and reinforced across the model pipeline and produce concrete failures in normative and descriptive application settings.

What carries the argument

SPAGBias framework, which combines a taxonomy of 62 urban micro-spaces, a prompt library, and three diagnostic layers (explicit forced-choice, probabilistic token asymmetry, and constructional semantic-narrative analysis) to detect and trace spatial gender bias.

If this is right

  • Bias expression changes with prompt design, temperature settings, and model scale.
  • Story generation shows emotion, wording, and social roles jointly shaping spatial gender narratives.
  • Patterns are reinforced across pre-training, instruction tuning, and reward modeling stages.
  • Biases lead to concrete failures when models are applied in normative or descriptive urban planning contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Mitigation would need to address multiple pipeline stages rather than a single fine-tuning step.
  • The same measurement approach could be applied to other spatial domains such as virtual environments or transportation networks.
  • Real-world urban datasets could serve as calibration targets for reducing excess associations in future models.

Load-bearing premise

The taxonomy of 62 urban micro-spaces together with the prompt library and three diagnostic layers accurately captures embedded spatial gender bias without introducing artifacts from the specific prompt designs or researcher-chosen categories.

What would settle it

Statistical counts of actual gender occupancy or usage for the same 62 micro-spaces drawn from city census or observational data; if model-generated associations closely match those counts rather than exceeding them, the claim of structured excess bias would be falsified.

Figures

Figures reproduced from arXiv: 2604.14672 by Binxian Su, Dong Yu, Haoye Lou, Pengyuan Liu, Shucheng Zhu, Weikang Wang, Ying Liu.

Figure 1
Figure 1. Figure 1: Example generations from GPT-4 and Qwen2 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A Framework for Measuring Spatial Gender Bias in LLMs.We construct a structured resource comprising [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cross-model variance and culturally salient [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Private space map for GPT-3.5-turbo. The [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Odds ratio of adjectives associated with spatial [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of log-probabilities in the private [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Spatial Taxonomy comprising 62 categories [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Average refusal rates across models under [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Consistency of gender bias measurements between binary (Prompt 1) and three-option (Prompt 0) prompts. Each point represents the normalized men’s frequency for a specific model-space pair under both prompt types. Points closer to the diagonal indicate higher consistency in bias measurement. Pairs with complete refusals to answer under Prompt 0 are excluded. Partial point transparency is used to indicate o… view at source ↗
Figure 11
Figure 11. Figure 11: Role annotation guidelines defining four behavioral roles (Leader, Supporter, Observer, Dependent) with [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Proportional distribution of men’s and women’s roles across four narrative categories. In private spaces, [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Pairwise Pearson correlation coefficients [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Top 10 co-occurrence rates of spatial and [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Top 10 co-occurrence rates of spatial and [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Gender bias maps of public spaces for six language models. Bluer regions indicate men-associated [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Gender bias maps of private spaces for six language models. Bluer regions indicate men-associated [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Detailed results of all models across the 62 urban spaces. Darker colors indicate higher EDI values, [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: EDI scores of six LLMs under five aggregated prompt types, with each prompt sampled 10 times. [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Comparison of log-probabilities in the public space between the [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗
read the original abstract

Large language models (LLMs) are being increasingly used in urban planning, but since gendered space theory highlights how gender hierarchies are embedded in spatial organization, there is concern that LLMs may reproduce or amplify such biases. We introduce SPAGBias - the first systematic framework to evaluate spatial gender bias in LLMs. It combines a taxonomy of 62 urban micro-spaces, a prompt library, and three diagnostic layers: explicit (forced-choice resampling), probabilistic (token-level asymmetry), and constructional (semantic and narrative role analysis). Testing six representative models, we identify structured gender-space associations that go beyond the public-private divide, forming nuanced micro-level mappings. Story generation reveals how emotion, wording, and social roles jointly shape "spatial gender narratives". We also examine how prompt design, temperature, and model scale influence bias expression. Tracing experiments indicate that these patterns are embedded and reinforced across the model pipeline (pre-training, instruction tuning, and reward modeling), with model associations found to substantially exceed real-world distributions. Downstream experiments further reveal that such biases produce concrete failures in both normative and descriptive application settings. This work connects sociological theory with computational analysis, extending bias research into the spatial domain and uncovering how LLMs encode social gender cognition through language.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SPAGBias, a systematic framework combining a taxonomy of 62 urban micro-spaces, a prompt library, and three diagnostic layers (explicit forced-choice resampling, probabilistic token-level asymmetry, and constructional semantic/narrative analysis) to evaluate spatial gender bias in LLMs. Testing six models, it identifies structured gender-space associations beyond the public-private divide that substantially exceed real-world distributions, traces these patterns through pre-training, instruction tuning, and reward modeling stages, examines influences of prompt design and model scale, and shows downstream failures in normative and descriptive settings.

Significance. If the central claims hold, the work is significant for bridging sociological gender-space theory with computational analysis of LLMs, extending bias research to the spatial domain. Credit is due for the multi-layered diagnostic framework, the tracing experiments across the model pipeline, and the downstream application tests, which provide concrete evidence of impact if the measurements are robust.

major comments (2)
  1. [§4.3] The claim that 'model associations found to substantially exceed real-world distributions' is load-bearing for the tracing and exceedance results, but the manuscript does not provide an independent empirical baseline (e.g., from census counts or time-use surveys) for the 62 micro-spaces; if derived from the same sources as the taxonomy, this introduces potential circularity in the comparison.
  2. [§3.2] Details on statistical controls for prompt variations, temperature effects, and how the three diagnostic layers handle researcher-defined categories are insufficient in the methods, which is necessary to rule out artifacts in the measured biases.
minor comments (2)
  1. The specific six models tested should be named in the abstract or early introduction for clarity.
  2. [Figure 3] The story generation examples could include more quantitative metrics alongside the qualitative analysis to strengthen the narrative role claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our manuscript. Their comments highlight important areas for clarification and strengthening, particularly around empirical baselines and methodological transparency. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§4.3] The claim that 'model associations found to substantially exceed real-world distributions' is load-bearing for the tracing and exceedance results, but the manuscript does not provide an independent empirical baseline (e.g., from census counts or time-use surveys) for the 62 micro-spaces; if derived from the same sources as the taxonomy, this introduces potential circularity in the comparison.

    Authors: We appreciate the referee identifying this as a load-bearing claim and the risk of circularity. The taxonomy of 62 micro-spaces draws from urban sociology and planning literature on gendered spaces, while the real-world distribution estimates were separately derived from national time-use surveys (e.g., American Time Use Survey) and occupational census data, mapped onto the micro-space categories. However, the current manuscript does not explicitly document the mapping procedure, raw percentages, or independence of these sources, which could reasonably raise the concern noted. In the revised version, we will add a new subsection in §4.3 (and an accompanying appendix table) that lists the precise survey sources, the percentage estimates for each of the 62 spaces, the mapping methodology, and a direct comparison of model vs. real-world distributions. This will make the exceedance results fully transparent and address the circularity issue. revision: yes

  2. Referee: [§3.2] Details on statistical controls for prompt variations, temperature effects, and how the three diagnostic layers handle researcher-defined categories are insufficient in the methods, which is necessary to rule out artifacts in the measured biases.

    Authors: We agree that additional methodological detail is required to demonstrate that the measured biases are not artifacts of prompt phrasing, sampling parameters, or researcher-defined categories. In the revised manuscript we will expand §3.2 with: (i) a description of the prompt library construction, including synonym-variation testing and reporting that bias patterns remained stable across phrasings; (ii) explicit temperature sensitivity results (0.0, 0.7, and 1.0) showing that core gender-space associations persist; and (iii) protocols for each diagnostic layer—forced-choice resampling with multiple random seeds, token-level log-probability analysis that operates independently of category labels, and blinded narrative coding with inter-annotator agreement statistics (Cohen’s κ > 0.8). These additions will be included to strengthen the robustness claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external empirical comparisons

full rationale

The paper defines a taxonomy, prompt library, and three diagnostic layers to measure LLM outputs, then compares resulting gender-space associations against external real-world distributions. No equations, fitted parameters, or self-citations are shown to reduce the exceedance or tracing results to the inputs by construction. The derivation chain remains self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the new taxonomy and diagnostics plus the assumption that observed model patterns reflect pre-existing societal biases rather than evaluation artifacts.

axioms (1)
  • domain assumption Gendered space theory highlights how gender hierarchies are embedded in spatial organization
    Invoked to justify concern that LLMs may reproduce or amplify such biases.

pith-pipeline@v0.9.0 · 5539 in / 1153 out tokens · 25350 ms · 2026-05-10T11:49:53.038181+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 25 canonical work pages · 2 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Marah I Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Hassan Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, and 66 others. 2024. https://www.microsoft.com/en-us...

  4. [4]

    Alibaba Group . 2024. https://huggingface.co/Qwen/Qwen2-7B-Instruct Qwen2: A high-performance language model family . Accessed: 2025-04-21

  5. [5]

    Michael GW Bamberg. 1997. Positioning between structure and performance

  6. [6]

    Yasminah Beebeejaun. 2017. Gender, urban space, and the right to everyday life. Journal of Urban Affairs, 39(3):323--334

  7. [7]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell

    Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. https://doi.org/10.1145/3442188.3445922 On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, page 610–623, New York, NY, USA. Association for Computing...

  8. [8]

    Kirti Bhagat, Kinshuk Vasisht, and Danish Pruthi. 2025. https://arxiv.org/abs/2411.07320 Richer output for richer countries: Uncovering geographical disparities in generated stories and travel recommendations . Preprint, arXiv:2411.07320

  9. [9]

    Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29

  10. [10]

    Pierre Bourdieu. 2001. Masculine domination. Stanford University Press

  11. [11]

    Peter I Buerhaus. 2010. American nursing: A history of knowledge, authority, and the meaning of work. JAMA, 304(20):2301--2302

  12. [12]

    Matthew Carmona. 2021. Public places urban spaces: The dimensions of urban design. Routledge

  13. [13]

    Aili Chen, Xuyang Ge, Ziquan Fu, Yanghua Xiao, and Jiangjie Chen. 2024. https://arxiv.org/abs/2409.08069 Travelagent: An ai assistant for personalized travel planning . Preprint, arXiv:2409.08069

  14. [14]

    Robert William Connell. 2020. Masculinities. Routledge

  15. [15]

    DeepSeek. 2024. https://deepseekcoder.github.io/ Deepseek llm: Scaling open-source language models with dense and mixture-of-experts models . Accessed: 2025-04-21

  16. [16]

    Dorothy Dinnerstein. 1999. The mermaid and the minotaur: Sexual arrangements and human malaise. Other Press, LLC

  17. [17]

    Jesse Dodge, Maarten Sap, Ana Marasovi \'c , William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.98 Documenting large webtext corpora: A case study on the colossal clean crawled corpus . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Process...

  18. [18]

    arXiv preprint arXiv:2310.20707 , year=

    Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hanna Hajishirzi, Noah A. Smith, and Jesse Dodge. 2024. https://arxiv.org/abs/2310.20707 What's in my big data? Preprint, arXiv:2310.20707

  19. [19]

    Jean Bethke Elshtain. 2020. Public man, private woman: Women in social and political thought. Princeton University Press

  20. [20]

    Federico Errica, Davide Sanvito, Giuseppe Siracusano, and Roberto Bifulco. 2025. https://doi.org/10.18653/v1/2025.naacl-long.73 What did I do wrong? quantifying LLM s' sensitivity and consistency to prompt engineering . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Lang...

  21. [21]

    Betty Friedan. 2013. The feminine mystique. WW Norton & Company

  22. [22]

    Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. https://doi.org/10.1073/pnas.1720347115 Word embeddings quantify 100 years of gender and ethnic stereotypes . Proceedings of the National Academy of Sciences, 115(16)

  23. [23]

    Fabrizio Gilardi, Meysam Alizadeh, and Ma \"e l Kubli. 2023. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120

  24. [24]

    Erving Goffman. 1981. Forms of talk. University of Pennsylvania Press

  25. [25]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

  26. [26]

    Groq. 2024. https://groq.com/ Groq api and lpu inference engine . Accessed: 2025-04-18

  27. [27]

    Michael Alexander Kirkwood Halliday and Christian MIM Matthiessen. 2014. Halliday's introduction to functional grammar. (No Title)

  28. [28]

    Rom Harr \'e and 1 others. 2012. Positioning theory: Moral dimensions of social-cultural psychology. The Oxford handbook of culture and psychology, 1:191--206

  29. [29]

    Alibaba Cloud Intelligence. 2024. https://www.alibabacloud.com/ Alibaba cloud ai computing platform - qwen2 deployment . Accessed: 2025-04-18

  30. [30]

    Susan Jamieson. 2004. Likert scales: How to (ab) use them? Medical education, 38(12):1217--1218

  31. [31]

    Lucinda J Kaukas. 2002. Gender space architecture: An interdisciplinary introduction

  32. [32]

    Mohammad

    Svetlana Kiritchenko and Saif M. Mohammad. 2018. https://arxiv.org/abs/1805.04508 Examining gender and race bias in two hundred sentiment analysis systems . Preprint, arXiv:1805.04508

  33. [33]

    Henri Lefebrve. 1991. The production of space. Basil Backwell

  34. [34]

    Zhonghang Li, Lianghao Xia, Xubin Ren, Jiabin Tang, Tianyi Chen, Yong Xu, and Chao Huang. 2025. https://arxiv.org/abs/2504.02009 Urban computing in the era of large language models . Preprint, arXiv:2504.02009

  35. [35]

    Kevin Lynch. 1964. The image of the city. MIT press

  36. [36]

    Rohin Manvi, Samar Khanna, Marshall Burke, David Lobell, and Stefano Ermon. 2024 a . Large language models are geographically biased. arXiv preprint arXiv:2402.02680

  37. [37]

    Rohin Manvi, Samar Khanna, Gengchen Mai, Marshall Burke, David B Lobell, and Stefano Ermon. 2024 b . Geollm: Extracting geospatial knowledge from large language models. In ICLR

  38. [38]

    Doreen Massey. 2013. Space, place and gender. John Wiley & Sons

  39. [39]

    and Rudinger, Rachel

    Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. 2019. https://doi.org/10.18653/v1/N19-1063 On measuring social biases in sentence encoders . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pa...

  40. [40]

    Mintel . 2012. https://www.mintel.com/press-centre/women-more-likely-to-visit-a-salon-but-a-growing-number-of-men-interested-in-these-services Women more likely to visit a salon, but a growing number of men interested in these services . Accessed: 2026-04-13

  41. [41]

    Shujaat Mirza, Bruno Coelho, Yuyuan Cui, Christina P \"o pper, and Damon McCoy. 2024. Global-liar: Factuality of llms over time and geographic regions. arXiv preprint arXiv:2401.17839

  42. [42]

    Sharlene Mollett and Caroline Faria. 2018. The spatialities of intersectional thinking: fashioning feminist geographic futures. Gender, Place & Culture, 25(4):565--577

  43. [43]

    Morningstar . 2023. https://www.morningstar.com/personal-finance/100-must-know-statistics-about-long-term-care-2023-edition 100 must-know statistics about long-term care: 2023 edition . Accessed: 2026-04-13

  44. [44]

    National Center for Health Statistics . 2024. https://www.ahcancal.org/News-and-Communications/Blog/Pages/NCHS-Publishes-New-Data-Brief-on-Residential-Care-Community-Resident-Characteristics.aspx NCHS publishes new data brief on residential care community resident characteristics . Accessed: 2026-04-13

  45. [45]

    Daisuke Oba, Masahiro Kaneko, and Danushka Bollegala. 2024. In-contextual gender bias suppression for large language models. EACL (Findings), pages 1722--1742

  46. [46]

    OpenAI. 2022. https://platform.openai.com/docs/models/gpt-3-5 Gpt-3.5-turbo . Accessed: 2025-04-18

  47. [47]

    OpenAI. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . Preprint, arXiv:2303.08774. This report covers the development and evaluation of GPT-4

  48. [48]

    OpenAI. 2024. https://platform.openai.com/docs Openai api documentation . Accessed: 2025-04-18

  49. [49]

    Rachel Pain. 2001. https://doi.org/10.1080/00420980120046590 Gender, race, age and fear in the city . Urban Studies, 38(5-6):899--913

  50. [50]

    C.C. Perez. 2019. https://books.google.com/books?id=GdmEDwAAQBAJ Invisible Women: Data Bias in a World Designed for Men . Abrams Press

  51. [51]

    Lakshmi Puri. 2016. Speech by deputy executive director lakshmi puri at the women’s and youth assembly at habitat iii. https://www.unwomen.org/en/news-stories. Speech at the Women’s and Youth Assembly, Habitat III, UN Women

  52. [52]

    Matthew Renze. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.432 The effect of sampling temperature on problem solving in large language models . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7346--7356, Miami, Florida, USA. Association for Computational Linguistics

  53. [53]

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. https://arxiv.org/abs/2310.11324 Quantifying language models' sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting . Preprint, arXiv:2310.11324

  54. [54]

    Joanne Sharp. 2009. Geography and gender: what belongs to feminist geography? emotion, power and change. Progress in human geography, 33(1):74--80

  55. [55]

    Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. https://doi.org/10.18653/v1/D19-1339 The woman worked as a babysitter: On biases in language generation . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),...

  56. [56]

    Rachel Silvey. 2006. Geographies of gender and migration: Spatializing social difference 1. International Migration Review, 40(1):64--81

  57. [57]

    Bernd Carsten Stahl and Damian Eke. 2024. https://doi.org/10.1016/j.ijinfomgt.2023.102700 The ethics of chatgpt – exploring the ethical issues of an emerging technology . International Journal of Information Management, 74:102700

  58. [58]

    A., and Zettlemoyer, L

    Gabriel Stanovsky, Noah A. Smith, and Luke Zettlemoyer. 2019. https://doi.org/10.18653/v1/P19-1164 Evaluating gender bias in machine translation . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1679--1684, Florence, Italy. Association for Computational Linguistics

  59. [59]

    Petter T \"o rnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588

  60. [60]

    UN Women . 2017. https://www.unwomen.org/en/news/in-focus/csw61/redistribute-unpaid-work Redistribute unpaid work . Accessed: 2025-04-18

  61. [61]

    United Nations Industrial Development Organization . 2020. https://news.un.org/zh/story/2020/12/1073462 Women in industry: Why we need to improve statistics on gender . Accessed: 2026-04-13

  62. [62]

    Eva Vanmassenhove, Christian Hardmeier, and Andy Way. 2018. https://doi.org/10.18653/v1/D18-1334 Getting gender right in neural machine translation . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3003--3008, Brussels, Belgium. Association for Computational Linguistics

  63. [63]

    kelly is a warm person, joseph is a role model

    Yixin Wan, George Pu, Jiao Sun, Aparna Garimella, Kai-Wei Chang, and Nanyun Peng. 2023. " kelly is a warm person, joseph is a role model": Gender biases in llm-generated reference letters. arXiv preprint arXiv:2310.09219

  64. [64]

    Ho, and Sanmi Koyejo

    Angelina Wang, Michelle Phan, Daniel E. Ho, and Sanmi Koyejo. 2025. https://doi.org/10.18653/v1/2025.acl-long.341 Fairness through difference awareness: Measuring Desired group discrimination in LLM s . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6867--6893, Vienna, Austria. Ass...

  65. [65]

    World Bank . 2025. https://data.worldbank.org.cn/indicator/SL.IND.EMPL.FE.ZS World development indicators: Employment in industry and labor force participation by gender . Indicators used: SL.IND.EMPL.FE.ZS, SL.EMP.TOTL.SP.FE.ZS, SL.IND.EMPL.MA.ZS, SL.EMP.TOTL.SP.MA.ZS. Accessed: 2026-04-13

  66. [66]

    Yifan Zhang, Cheng Wei, Zhengting He, and Wenhao Yu. 2024. Geogpt: An assistant for understanding and processing geospatial tasks. International Journal of Applied Earth Observation and Geoinformation, 131:103976

  67. [67]

    Jinman Zhao, Yitian Ding, Chen Jia, Yining Wang, and Zifan Qian. 2024. Gender bias in large language models across multiple languages. arXiv preprint arXiv:2403.00277