Neutrality Bites: Gender Representation in AI-Generated Animal Stories

Imani Finkley; Melanie Walsh; Yuanxi Li

arxiv: 2606.07969 · v1 · pith:ZHVBENCInew · submitted 2026-06-06 · 💻 cs.CL · cs.AI

Neutrality Bites: Gender Representation in AI-Generated Animal Stories

Imani Finkley , Yuanxi Li , Melanie Walsh This is my paper

Pith reviewed 2026-06-27 20:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords gender biaslarge language modelsanimal storiesneutralitymasculine biasAI-generated narrativesidentity erasure

0 comments

The pith

Large language models assign masculine gender to animal characters far more often than feminine when they assign gender at all.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests six leading LLMs by prompting them to complete stories about seven anthropomorphic animals whose gender is left unstated. Models often avoid assigning any gender or use neutral pronouns, but when they do specify gender the results show masculine characters in 40.6 percent of stories and feminine characters in only 2.2 percent. The pattern holds across four narrative settings and a range of temperatures. The authors conclude that efforts to achieve neutrality can produce the erasure of feminine and other marginalized identities rather than balanced representation. They argue that new approaches are needed to distribute social possibilities more evenly across imagined subjects.

Core claim

Across 23.8K stories, models avoid gendering the animal character in 19 percent of cases on average and use gender-neutral language in 38.2 percent, yet assign masculine gender in 40.6 percent and feminine gender in 2.2 percent when gender is assigned; the authors therefore claim that neutrality in LLMs contributes to the erasure of marginalized perspectives and identities.

What carries the argument

Prompting LLMs to complete English-language stories about anthropomorphic animals with unstated gender, then measuring the resulting patterns of gender assignment versus neutrality.

Load-bearing premise

The six chosen LLMs, seven animals, four settings, and temperature values are representative enough to support general claims about how LLMs assign gender in ambiguous narrative contexts.

What would settle it

A larger study using additional models or prompts that finds feminine animal characters appearing at rates equal to or higher than masculine ones when gender is assigned.

Figures

Figures reproduced from arXiv: 2606.07969 by Imani Finkley, Melanie Walsh, Yuanxi Li.

**Figure 2.** Figure 2: Gender distribution results from 1327 human survey respondents given a story-completion task with varying [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Gender assignment distribution results from 23.8K text outputs from six LLMs given a story-completion task with [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Gender assignment distribution results when gender-neutral representation is not considered, highlighting a significant [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Gender assignment distribution across narrative settings for generated stories. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Gender assignment distribution across temperatures for generated stories. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Gender bias in AI-generated stories is a well-documented problem. While much attention has been paid to reducing or mitigating this bias, it is not always clear whether interventions produce genuinely fairer results. To investigate this issue, we examine how large language models (LLMs) handle gender assignment in a narrative context that is popular, highly ambiguous, and also known to closely reproduce human stereotypes: stories about talking animals. We prompt six leading LLMs to complete an English-language story about seven different anthropomorphic animal characters whose gender is unstated. We additionally iterate with four different narrative settings and a range of model temperatures. Across the 23.8K stories, we find that models frequently avoid gendering the animal character in the story (19% on average) or use gender-neutral language like "it" or "its" (38.2% on average). However, when gender is assigned, there is a significant masculine bias. Feminine animal characters are virtually absent, present in just 2.2% of stories vs. 40.6% that feature masculine characters. Our findings point to a broader argument: neutrality bites. In other words, models that prioritize neutrality to address social bias may actually contribute to the erasure of marginalized perspectives and identities. We suggest that alternative strategies beyond neutrality need to be pursued, such as ones that more equally distribute social possibilities across imagined subjects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs default to masculine or neutral when gendering animal characters, with feminine cases at just 2.2%, but the gender measurement method is not described.

read the letter

The main thing to take from this is the reported split: when models assign gender in these animal stories, masculine characters appear in 40.6% of cases while feminine ones show up in only 2.2%. The authors also document frequent use of neutral language or no gender at all, and they argue that neutrality efforts can erase feminine perspectives.

The experiment itself is straightforward and reasonably scaled. They generated 23.8k stories from six LLMs across seven animals, four narrative settings, and a range of temperatures. Using talking animals as the prompt domain is a reasonable choice because gender is left open and the genre often carries stereotypes. The neutrality-erasure point is a direct way to frame how bias-reduction tactics might remove certain identities rather than balance them.

The weakest part is the measurement. The abstract gives no information on how gender was extracted from the stories, whether through keywords, manual review, or another model, and there are no agreement scores or error checks. Without that, the headline percentages rest on an unexamined step. The limited set of seven animals and four settings also means we do not know how much the result depends on those specific choices or whether other animals would shift the numbers. No breakdowns by model or animal appear in the summary either.

This is for people working on AI bias in creative text and ethics guidelines for content generation. A reader already following LLM stereotype work would get a concrete example in a new domain, though they would need the methods details to assess the numbers. It should go to peer review because the scale and the framing raise a question worth checking, even if the current version needs clearer measurement reporting.

Referee Report

3 major / 1 minor

Summary. The paper claims that LLMs prompted to generate stories about seven anthropomorphic animals with unstated gender (across six models, four narrative settings, and varied temperatures, yielding 23.8K stories) frequently avoid explicit gendering (19% average) or use neutral pronouns (38.2% average). When gender is assigned, a strong masculine bias appears (40.6% masculine characters vs. 2.2% feminine). The authors argue this shows that neutrality-focused approaches can erase marginalized identities and call for alternative strategies.

Significance. If the gender classification procedure proves reliable, the scale of the empirical counts provides concrete evidence that LLMs default to masculine or neutral representations in ambiguous narrative contexts, supporting the broader claim that neutrality can contribute to underrepresentation. The direct generation of 23.8K stories without fitted parameters is a methodological strength for reproducibility.

major comments (3)

[Methodology] Methodology section: The procedure for classifying generated stories as masculine, feminine, neutral, or ungendered is not described, including any annotation guidelines, inter-annotator agreement scores, validation against human judgments, or handling of ambiguous cases. This directly affects the reliability of the headline 2.2% and 40.6% figures.
[Results] Results section: No per-model, per-animal, or per-setting breakdowns or variance measures are reported for the gender assignment rates. Without these, it is impossible to determine whether the masculine bias generalizes beyond the specific choice of six LLMs and seven animals or is driven by particular combinations.
[Abstract] Abstract and Results: The reported percentages lack error bars, confidence intervals, or robustness checks across temperature values, leaving the central claim of 'significant masculine bias' vulnerable to unexamined measurement and sampling choices.

minor comments (1)

[Abstract] The abstract states the total story count but provides no details on the exact prompting template or how stories were deduplicated or filtered.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for improving the clarity and robustness of our manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Methodology] Methodology section: The procedure for classifying generated stories as masculine, feminine, neutral, or ungendered is not described, including any annotation guidelines, inter-annotator agreement scores, validation against human judgments, or handling of ambiguous cases. This directly affects the reliability of the headline 2.2% and 40.6% figures.

Authors: We agree that the classification procedure must be described in detail for the results to be interpretable. The current version of the manuscript does not include this information. In the revision, we will add a dedicated subsection to the methodology that specifies the exact rules for classifying stories (based on pronoun usage and explicit gender markers), how ambiguous cases were resolved, the annotation guidelines provided to any human coders, inter-annotator agreement statistics, and the results of a validation exercise against human judgments on a held-out sample. revision: yes
Referee: [Results] Results section: No per-model, per-animal, or per-setting breakdowns or variance measures are reported for the gender assignment rates. Without these, it is impossible to determine whether the masculine bias generalizes beyond the specific choice of six LLMs and seven animals or is driven by particular combinations.

Authors: The referee is correct that aggregate figures alone are insufficient to assess generalizability. We will revise the results section to include tables or supplementary figures with per-model, per-animal, and per-narrative-setting breakdowns of the gender assignment rates. We will also report variance measures (e.g., standard deviations across temperature settings) to allow readers to evaluate consistency. revision: yes
Referee: [Abstract] Abstract and Results: The reported percentages lack error bars, confidence intervals, or robustness checks across temperature values, leaving the central claim of 'significant masculine bias' vulnerable to unexamined measurement and sampling choices.

Authors: We accept this criticism. The revised manuscript will include error bars or confidence intervals for the key percentages in both the abstract and results. We will also add a short robustness subsection that reports gender assignment rates across the range of temperature values used, confirming that the masculine bias pattern is stable. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical counts from generated text

full rationale

The paper performs an empirical study by prompting six LLMs to generate stories about seven animals in four settings, then manually or automatically classifying the 23.8K outputs for gender assignment (masculine, feminine, neutral, or avoided). The headline percentages (40.6% masculine, 2.2% feminine when gendered) are simple frequency counts of observed tokens and pronouns in the generated text; no equations, fitted parameters, predictions, or self-citations are used to derive these figures from the inputs. The central claim is therefore a direct report of the experimental data rather than a reduction of the data to itself by construction. The selection of models/animals/settings is a methodological choice whose representativeness can be debated on external grounds, but it does not create a circular derivation chain.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim depends on the representativeness of the selected models and prompts plus the validity of automated or manual gender classification from generated text; no free parameters are fitted to the outcome data.

free parameters (2)

temperature values
Chosen by hand to explore generation diversity; not fitted to the gender statistics.
selection of seven animal characters
Chosen to ensure gender ambiguity; specific choice could influence observed bias rates.

axioms (2)

domain assumption LLMs produce text from which gender references can be reliably extracted and counted
Required to convert raw generations into the reported 2.2% and 40.6% figures.
domain assumption The tested models and settings are representative of typical LLM narrative behavior
Invoked when generalizing from the 23.8K stories to broader claims about AI gender handling.

pith-pipeline@v0.9.1-grok · 5777 in / 1299 out tokens · 34875 ms · 2026-06-27T20:11:20.819339+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 30 canonical work pages

[1]

Amin Abolghasemi, Leif Azzopardi, Arian Askari, Maarten de Rijke, and Suzan Verberne. 2024. Measuring Bias in a Ranked List Using Term-Based Representations. InAdvances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, 2024, Proceedings, Part V(Glasgow, United Kingdom). Springer-Verlag, Berli...

work page doi:10.1007/978-3-031-56069-9_1 2024
[2]

Anjali Adukia, Alex Eble, Emileigh Harrison, Hakizumwami Birali Runesha, and Teodora Szasz. 2023. What We Teach About Race and Gender: Representation in Images and Text of Children’s Books*.The Quarterly Journal of Economics138, 4 (Nov. 2023), 2225–2285. doi:10.1093/qje/qjad028

work page doi:10.1093/qje/qjad028 2023
[3]

Anthropic. 2025. System Card: Claude Sonnet 4.5. https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System- Card.pdf

2025
[4]

Stuart G. Baker. 1994. The Multinomial-Poisson Transformation.Journal of the Royal Statistical Society. Series D (The Statistician)43, 4 (1994), 495–504. http://www.jstor.org/stable/2348134

arXiv 1994
[5]

and Gebru, Timnit and McMillan-Major, Angelina and Shmitchell, Shmargaret

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency(Virtual Event, Canada)(FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. doi:10.114...

work page doi:10.1145/3442188.3445922 2021
[6]

Taylor Berry and Julia Wilkins. 2017. The Gendered Portrayal of Inanimate Characters in Children’s Books. 43, 2 (2017). FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Imani Finkley, Yuanxi Li, and Melanie Walsh

2017
[7]

Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (Technology) Is Power: A Critical Survey of “Bias” in NLP. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 5454...

work page doi:10.18653/v1/2020.acl-main.485 2020
[8]

Yang Trista Cao and Hal Daumé. 2021. Toward Gender-Inclusive Coreference Resolution: An Analysis of Gender and Bias Throughout the Machine Learning Lifecycle.Computational Linguistics47, 3 (Nov. 2021), 615–661. doi:10.1162/coli_a_00413

work page doi:10.1162/coli_a_00413 2021
[9]

Sarah Caré. 2024. Female Animal Characters in Roald Dahl’s Children’s Books: A Misogynistic Portrayal.Miscelánea: A Journal of English and American Studies69 (June 2024), 111–130. doi:10.26754/ojs_misc/mj.20248807

work page doi:10.26754/ojs_misc/mj.20248807 2024
[10]

Kennedy Casey, Kylee Novick, and Stella Lourenco. 2021. Sixty Years of Gender Representation in Children’s Books: Conditions Associated with Overrepresentation of Male versus Female Protagonists.PLOS ONE16 (Dec. 2021), e0260566. doi:10.1371/journal.pone.0260566

work page doi:10.1371/journal.pone.0260566 2021
[11]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al . 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

Pith/arXiv arXiv 2025
[12]

Hannah Devinney, Jenny Björklund, and Henrik Björklund. 2022. Theories of “Gender” in NLP Bias Research. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22). Association for Computing Machinery, New York, NY, USA, 2083–2102. doi:10.1145/3531146.3534627

work page doi:10.1145/3531146.3534627 2022
[13]

Dimgba, Sharon Oba, Ameeta Agrawal, and Philippe J

Martha O. Dimgba, Sharon Oba, Ameeta Agrawal, and Philippe J. Giabbanelli. 2025. Mitigation of Gender and Ethnicity Bias in AI-Generated Stories through Model Explanations. arXiv:2509.04515 [cs] doi:10.48550/arXiv.2509.04515

work page doi:10.48550/arxiv.2509.04515 2025
[14]

Bufan Gao and Elisa Kreiss. 2025. Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computational Linguistics, Suzhou, Ch...

2025
[15]

Somayeh Ghanbarzadeh, Yan Huang, Hamid Palangi, Radames Cruz Moreno, and Hamed Khanpour. 2023. Gender-tuning: Empowering Fine-tuning for Debiasing Pre-trained Language Models. InFindings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto...

work page doi:10.18653/v1/2023.findings-acl.336 2023
[16]

Tarleton Gillespie. 2024. Generative AI and the Politics of Visibility.Big Data & Society11, 2 (June 2024), 20539517241252131. doi:10.1177/20539517241252131

work page doi:10.1177/20539517241252131 2024
[17]

Google. [n. d.]. Gemini Storybook — for the Stories Only You Could Imagine. https://gemini.google/overview/storybook/
[18]

Liz Grauerholz and Bernice Pescosolido. 1989. Gender Representation in Children’s Literature: 1900-1984.Gender & Society - GENDER SOC3 (March 1989), 113–125. doi:10.1177/089124389003001008

work page doi:10.1177/089124389003001008 1989
[19]

Zhiting He, Jiayi Su, Li Chen, Tianqi Wang, and Ray Lc. 2025. ’I Recall the Past’: Exploring How People Collaborate with Generative AI to Create Cultural Heritage Narratives.Proc. ACM Hum.-Comput. Interact.9, 2, Article CSCW108 (May 2025), 30 pages. doi:10.1145/3711006

work page doi:10.1145/3711006 2025
[20]

The Mouse Looks Like a Boy

Thomas M. Hill and Katrina Bartow Jacobs. 2020. “The Mouse Looks Like a Boy”: Young Children’s Talk About Gender Across Human and Nonhuman Characters in Picture Books.Early Childhood Education Journal48, 1 (Jan. 2020), 93–102. doi:10.1007/s10643-019-00969-x

work page doi:10.1007/s10643-019-00969-x 2020
[21]

Sture Holm. 1979. A Simple Sequentially Rejective Multiple Test Procedure.Scandinavian Journal of Statistics6, 2 (1979), 65–70. http://www.jstor.org/stable/4615733

arXiv 1979
[22]

Tamanna Hossain, Sunipa Dev, and Sameer Singh. 2023. MISGENDERED: Limits of Large Language Models in Understanding Pronouns. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canad...

work page doi:10.18653/v1/2023.acl-long.293 2023
[23]

Ting-Yao Hsu, Yen-Chia Hsu, and Ting-Hao (Kenneth) Huang. 2019. On How Users Edit Computer-Generated Visual Stories. InExtended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI EA ’19). Association for Computing Machinery, New York, NY, USA, 1–6. doi:10.1145/3290607.3312965

work page doi:10.1145/3290607.3312965 2019
[24]

Hadas Kotek, Rikker Dockum, and David Sun. 2023. Gender bias and stereotypes in Large Language Models. InProceedings of The ACM Collective Intelligence Conference(Delft, Netherlands)(CI ’23). Association for Computing Machinery, New York, NY, USA, 12–24. doi:10.1145/3582269.3615599

work page doi:10.1145/3582269.3615599 2023
[25]

Mina Lee, Percy Liang, and Qian Yang. 2022. CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities. InCHI Conference on Human Factors in Computing Systems (CHI ’22). ACM, 1–19. doi:10.1145/3491102.3502030

work page doi:10.1145/3491102.3502030 2022
[26]

Molly Lewis, Matt Cooper Borkenhagen, Ellen Converse, Gary Lupyan, and Mark S Seidenberg. 2021. What Might Books Be Teaching Young Children About Gender? (2021)

2021
[27]

Li Lucy and David Bamman. 2021. Gender and Representation Bias in GPT-3 Generated Stories. InProceedings of the Third Workshop on Narrative Understanding, Nader Akoury, Faeze Brahman, Snigdha Chaturvedi, Elizabeth Clark, Mohit Iyyer, and Lara J. Martin (Eds.). Association for Computational Linguistics, Virtual, 48–55. doi:10.18653/v1/2021.nuse-1.5

work page doi:10.18653/v1/2021.nuse-1.5 2021
[28]

Janice McCabe, Emily Fairchild, Liz Grauerholz, Bernice Pescosolido, and Daniel Tope. 2011. Gender in Twentieth-Century Children’s Books.Gender & Society - GENDER SOC25 (March 2011), 197–226. doi:10.1177/0891243211398358 Neutrality Bites: Gender Representation in AI-Generated Animal Stories FAccT ’26, June 25–28, 2026, Montreal, QC, Canada

work page doi:10.1177/0891243211398358 2011
[29]

Jennifer Mickel, Maria De-Arteaga, Liu Leqi, and Kevin Tian. 2026. More of the Same: Persistent Representational Harms Under Increased Representation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id= R9k13fTGP0

2026
[30]

MistralAI. 2025. Mistral Medium 3.1 - Mistral AI. https://docs.mistral.ai/models/mistral-medium-3-1-25-08

2025
[31]

Joao Neves, Inês Costa, Joao Oliveira, Bruno Silva, and Joana Maia. 2023. Can Gender Nouns Influence the Stereotypes of Animals? Animals : an Open Access Journal from MDPI13, 16 (Aug. 2023), 2604. doi:10.3390/ani13162604

work page doi:10.3390/ani13162604 2023
[32]

Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, S...
[33]

arXiv:2512.13961 [cs.CL] https://arxiv.org/abs/2512.13961

Olmo 3. arXiv:2512.13961 [cs.CL] https://arxiv.org/abs/2512.13961

Pith/arXiv arXiv
[34]

OpenAI. 2024. GPT-4o System Card. https://cdn.openai.com/gpt-4o-system-card.pdf

2024
[35]

OpenAI. 2025. GPT-5 System Card. https://cdn.openai.com/gpt-5-system-card.pdf

2025
[36]

Joanne O’Sullivan. 2025. Google Launches Personalized Gemini Storybook App to Industry Concern. https://www.publishersweekly. com/pw/by-topic/childrens/childrens-industry-news/article/98452-google-launches-personalized-gemini-storybook-app-to- industry-concern.html

2025
[37]

Shon Otmazgin, Arie Cattan, and Yoav Goldberg. 2022. F-coref: Fast, Accurate and Easy to Use Coreference Resolution. InProceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations, Wray Buntine and Maria Liaka...

work page doi:10.18653/v1/2022.aacl-demo.6 2022
[38]

Ekstrand

Amifa Raj and Michael D. Ekstrand. 2022. Fire Dragon and Unicorn Princess; Gender Stereotypes and Children’s Products in Search Engine Responses. arXiv. doi:10.48550/ARXIV.2206.13747 Version Number: 1

work page doi:10.48550/arxiv.2206.13747 2022
[39]

2022.Alice and Sparkle

Ammaar Reshi. 2022.Alice and Sparkle. Independently published. Text generated using ChatGPT; Illustrations generated using Midjourney

2022
[40]

Donya Rooein, Vilém Zouhar, Debora Nozza, and Dirk Hovy. 2025. Biased Tales: Cultural and Topic Bias in Generating Children’s Stories. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computational Linguistics, Su...

work page doi:10.18653/v1/ 2025
[41]

Shirin Seyedsalehi. 2025. Mitigating Gender Bias in Information Retrieval Systems. InAdvances in Information Retrieval, Claudia Hauff, Craig Macdonald, Dietmar Jannach, Gabriella Kazai, Franco Maria Nardini, Fabio Pinelli, Fabrizio Silvestri, and Nicola Tonellotto (Eds.). Vol. 15576. Springer Nature Switzerland, Cham, 227–232. doi:10.1007/978-3-031-88720-...

work page doi:10.1007/978-3-031-88720-8_36 2025
[42]

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. The Woman Worked as a Babysitter: On Biases in Language Generation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, ...

work page doi:10.18653/v1/d19-1339 2019
[43]

Sugimoto, and Thema Monroe-White

Evan Shieh, Faye Marie Vassel, Cassidy R. Sugimoto, and Thema Monroe-White. 2025. Laissez-Faire Harms: Algorithmic Biases in Generative Language Models (Extended Abstract).Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society8, 3 (Oct. 2025), 2373–2374. doi:10.1609/aies.v8i3.36722

work page doi:10.1609/aies.v8i3.36722 2025
[44]

Laura Spillner. 2024. Unexpected Gender Stereotypes in AI-generated Stories: Hairdressers Are Female, but so Are Doctors. InProceedings of the Text2Story’24 Workshop. Glasgow, Scotland. https://ceur-ws.org/Vol-3671/paper10.pdf

2024
[45]

Goya van Boven, Yupei Du, and Dong Nguyen. 2024. Transforming Dutch: Debiasing Dutch Coreference Resolution Systems for Non-binary Pronouns. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24). Association for Computing Machinery, New York, NY, USA, 2470–2483. doi:10.1145/3630106.3659049

work page doi:10.1145/3630106.3659049 2024
[46]

Melanie Walsh, Russell Samora, Michelle Pera-McGhee, and Jan Diehm. 2025. Bears Will Be Boys: A data analysis of animal gender in children’s books. https://pudding.cool/2025/07/kids-books/.The Pudding(July 2025)

2025
[47]

Jules Watson, Xi Wang, Raymond Liu, Suzanne Stevenson, and Barend Beekhuizen. 2025. Analyzing values about gendered language reform in LLMs’ revisions. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computationa...

work page doi:10.18653/v1/2025.emnlp-main.1277 2025
[48]

Jennie Yabroff. 2016. Why Are There so Few Girls in Children’s Books?The Washington Post(Jan. 2016). FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Imani Finkley, Yuanxi Li, and Melanie Walsh

2016
[49]

Foulds, and Shimei Pan

Tao Zhang, Ziqian Zeng, YuxiangXiao YuxiangXiao, Huiping Zhuang, Cen Chen, James R. Foulds, and Shimei Pan. 2025. GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Sh...

work page doi:10.18653/v1/2025.acl-long.553 2025

[1] [1]

Amin Abolghasemi, Leif Azzopardi, Arian Askari, Maarten de Rijke, and Suzan Verberne. 2024. Measuring Bias in a Ranked List Using Term-Based Representations. InAdvances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, 2024, Proceedings, Part V(Glasgow, United Kingdom). Springer-Verlag, Berli...

work page doi:10.1007/978-3-031-56069-9_1 2024

[2] [2]

Anjali Adukia, Alex Eble, Emileigh Harrison, Hakizumwami Birali Runesha, and Teodora Szasz. 2023. What We Teach About Race and Gender: Representation in Images and Text of Children’s Books*.The Quarterly Journal of Economics138, 4 (Nov. 2023), 2225–2285. doi:10.1093/qje/qjad028

work page doi:10.1093/qje/qjad028 2023

[3] [3]

Anthropic. 2025. System Card: Claude Sonnet 4.5. https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System- Card.pdf

2025

[4] [4]

Stuart G. Baker. 1994. The Multinomial-Poisson Transformation.Journal of the Royal Statistical Society. Series D (The Statistician)43, 4 (1994), 495–504. http://www.jstor.org/stable/2348134

arXiv 1994

[5] [5]

and Gebru, Timnit and McMillan-Major, Angelina and Shmitchell, Shmargaret

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency(Virtual Event, Canada)(FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. doi:10.114...

work page doi:10.1145/3442188.3445922 2021

[6] [6]

Taylor Berry and Julia Wilkins. 2017. The Gendered Portrayal of Inanimate Characters in Children’s Books. 43, 2 (2017). FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Imani Finkley, Yuanxi Li, and Melanie Walsh

2017

[7] [7]

Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (Technology) Is Power: A Critical Survey of “Bias” in NLP. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 5454...

work page doi:10.18653/v1/2020.acl-main.485 2020

[8] [8]

Yang Trista Cao and Hal Daumé. 2021. Toward Gender-Inclusive Coreference Resolution: An Analysis of Gender and Bias Throughout the Machine Learning Lifecycle.Computational Linguistics47, 3 (Nov. 2021), 615–661. doi:10.1162/coli_a_00413

work page doi:10.1162/coli_a_00413 2021

[9] [9]

Sarah Caré. 2024. Female Animal Characters in Roald Dahl’s Children’s Books: A Misogynistic Portrayal.Miscelánea: A Journal of English and American Studies69 (June 2024), 111–130. doi:10.26754/ojs_misc/mj.20248807

work page doi:10.26754/ojs_misc/mj.20248807 2024

[10] [10]

Kennedy Casey, Kylee Novick, and Stella Lourenco. 2021. Sixty Years of Gender Representation in Children’s Books: Conditions Associated with Overrepresentation of Male versus Female Protagonists.PLOS ONE16 (Dec. 2021), e0260566. doi:10.1371/journal.pone.0260566

work page doi:10.1371/journal.pone.0260566 2021

[11] [11]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al . 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

Pith/arXiv arXiv 2025

[12] [12]

Hannah Devinney, Jenny Björklund, and Henrik Björklund. 2022. Theories of “Gender” in NLP Bias Research. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22). Association for Computing Machinery, New York, NY, USA, 2083–2102. doi:10.1145/3531146.3534627

work page doi:10.1145/3531146.3534627 2022

[13] [13]

Dimgba, Sharon Oba, Ameeta Agrawal, and Philippe J

Martha O. Dimgba, Sharon Oba, Ameeta Agrawal, and Philippe J. Giabbanelli. 2025. Mitigation of Gender and Ethnicity Bias in AI-Generated Stories through Model Explanations. arXiv:2509.04515 [cs] doi:10.48550/arXiv.2509.04515

work page doi:10.48550/arxiv.2509.04515 2025

[14] [14]

Bufan Gao and Elisa Kreiss. 2025. Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computational Linguistics, Suzhou, Ch...

2025

[15] [15]

Somayeh Ghanbarzadeh, Yan Huang, Hamid Palangi, Radames Cruz Moreno, and Hamed Khanpour. 2023. Gender-tuning: Empowering Fine-tuning for Debiasing Pre-trained Language Models. InFindings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto...

work page doi:10.18653/v1/2023.findings-acl.336 2023

[16] [16]

Tarleton Gillespie. 2024. Generative AI and the Politics of Visibility.Big Data & Society11, 2 (June 2024), 20539517241252131. doi:10.1177/20539517241252131

work page doi:10.1177/20539517241252131 2024

[17] [17]

Google. [n. d.]. Gemini Storybook — for the Stories Only You Could Imagine. https://gemini.google/overview/storybook/

[18] [18]

Liz Grauerholz and Bernice Pescosolido. 1989. Gender Representation in Children’s Literature: 1900-1984.Gender & Society - GENDER SOC3 (March 1989), 113–125. doi:10.1177/089124389003001008

work page doi:10.1177/089124389003001008 1989

[19] [19]

Zhiting He, Jiayi Su, Li Chen, Tianqi Wang, and Ray Lc. 2025. ’I Recall the Past’: Exploring How People Collaborate with Generative AI to Create Cultural Heritage Narratives.Proc. ACM Hum.-Comput. Interact.9, 2, Article CSCW108 (May 2025), 30 pages. doi:10.1145/3711006

work page doi:10.1145/3711006 2025

[20] [20]

The Mouse Looks Like a Boy

Thomas M. Hill and Katrina Bartow Jacobs. 2020. “The Mouse Looks Like a Boy”: Young Children’s Talk About Gender Across Human and Nonhuman Characters in Picture Books.Early Childhood Education Journal48, 1 (Jan. 2020), 93–102. doi:10.1007/s10643-019-00969-x

work page doi:10.1007/s10643-019-00969-x 2020

[21] [21]

Sture Holm. 1979. A Simple Sequentially Rejective Multiple Test Procedure.Scandinavian Journal of Statistics6, 2 (1979), 65–70. http://www.jstor.org/stable/4615733

arXiv 1979

[22] [22]

Tamanna Hossain, Sunipa Dev, and Sameer Singh. 2023. MISGENDERED: Limits of Large Language Models in Understanding Pronouns. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canad...

work page doi:10.18653/v1/2023.acl-long.293 2023

[23] [23]

Ting-Yao Hsu, Yen-Chia Hsu, and Ting-Hao (Kenneth) Huang. 2019. On How Users Edit Computer-Generated Visual Stories. InExtended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI EA ’19). Association for Computing Machinery, New York, NY, USA, 1–6. doi:10.1145/3290607.3312965

work page doi:10.1145/3290607.3312965 2019

[24] [24]

Hadas Kotek, Rikker Dockum, and David Sun. 2023. Gender bias and stereotypes in Large Language Models. InProceedings of The ACM Collective Intelligence Conference(Delft, Netherlands)(CI ’23). Association for Computing Machinery, New York, NY, USA, 12–24. doi:10.1145/3582269.3615599

work page doi:10.1145/3582269.3615599 2023

[25] [25]

Mina Lee, Percy Liang, and Qian Yang. 2022. CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities. InCHI Conference on Human Factors in Computing Systems (CHI ’22). ACM, 1–19. doi:10.1145/3491102.3502030

work page doi:10.1145/3491102.3502030 2022

[26] [26]

Molly Lewis, Matt Cooper Borkenhagen, Ellen Converse, Gary Lupyan, and Mark S Seidenberg. 2021. What Might Books Be Teaching Young Children About Gender? (2021)

2021

[27] [27]

Li Lucy and David Bamman. 2021. Gender and Representation Bias in GPT-3 Generated Stories. InProceedings of the Third Workshop on Narrative Understanding, Nader Akoury, Faeze Brahman, Snigdha Chaturvedi, Elizabeth Clark, Mohit Iyyer, and Lara J. Martin (Eds.). Association for Computational Linguistics, Virtual, 48–55. doi:10.18653/v1/2021.nuse-1.5

work page doi:10.18653/v1/2021.nuse-1.5 2021

[28] [28]

Janice McCabe, Emily Fairchild, Liz Grauerholz, Bernice Pescosolido, and Daniel Tope. 2011. Gender in Twentieth-Century Children’s Books.Gender & Society - GENDER SOC25 (March 2011), 197–226. doi:10.1177/0891243211398358 Neutrality Bites: Gender Representation in AI-Generated Animal Stories FAccT ’26, June 25–28, 2026, Montreal, QC, Canada

work page doi:10.1177/0891243211398358 2011

[29] [29]

Jennifer Mickel, Maria De-Arteaga, Liu Leqi, and Kevin Tian. 2026. More of the Same: Persistent Representational Harms Under Increased Representation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id= R9k13fTGP0

2026

[30] [30]

MistralAI. 2025. Mistral Medium 3.1 - Mistral AI. https://docs.mistral.ai/models/mistral-medium-3-1-25-08

2025

[31] [31]

Joao Neves, Inês Costa, Joao Oliveira, Bruno Silva, and Joana Maia. 2023. Can Gender Nouns Influence the Stereotypes of Animals? Animals : an Open Access Journal from MDPI13, 16 (Aug. 2023), 2604. doi:10.3390/ani13162604

work page doi:10.3390/ani13162604 2023

[32] [32]

Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, S...

[33] [33]

arXiv:2512.13961 [cs.CL] https://arxiv.org/abs/2512.13961

Olmo 3. arXiv:2512.13961 [cs.CL] https://arxiv.org/abs/2512.13961

Pith/arXiv arXiv

[34] [34]

OpenAI. 2024. GPT-4o System Card. https://cdn.openai.com/gpt-4o-system-card.pdf

2024

[35] [35]

OpenAI. 2025. GPT-5 System Card. https://cdn.openai.com/gpt-5-system-card.pdf

2025

[36] [36]

Joanne O’Sullivan. 2025. Google Launches Personalized Gemini Storybook App to Industry Concern. https://www.publishersweekly. com/pw/by-topic/childrens/childrens-industry-news/article/98452-google-launches-personalized-gemini-storybook-app-to- industry-concern.html

2025

[37] [37]

Shon Otmazgin, Arie Cattan, and Yoav Goldberg. 2022. F-coref: Fast, Accurate and Easy to Use Coreference Resolution. InProceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations, Wray Buntine and Maria Liaka...

work page doi:10.18653/v1/2022.aacl-demo.6 2022

[38] [38]

Ekstrand

Amifa Raj and Michael D. Ekstrand. 2022. Fire Dragon and Unicorn Princess; Gender Stereotypes and Children’s Products in Search Engine Responses. arXiv. doi:10.48550/ARXIV.2206.13747 Version Number: 1

work page doi:10.48550/arxiv.2206.13747 2022

[39] [39]

2022.Alice and Sparkle

Ammaar Reshi. 2022.Alice and Sparkle. Independently published. Text generated using ChatGPT; Illustrations generated using Midjourney

2022

[40] [40]

Donya Rooein, Vilém Zouhar, Debora Nozza, and Dirk Hovy. 2025. Biased Tales: Cultural and Topic Bias in Generating Children’s Stories. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computational Linguistics, Su...

work page doi:10.18653/v1/ 2025

[41] [41]

Shirin Seyedsalehi. 2025. Mitigating Gender Bias in Information Retrieval Systems. InAdvances in Information Retrieval, Claudia Hauff, Craig Macdonald, Dietmar Jannach, Gabriella Kazai, Franco Maria Nardini, Fabio Pinelli, Fabrizio Silvestri, and Nicola Tonellotto (Eds.). Vol. 15576. Springer Nature Switzerland, Cham, 227–232. doi:10.1007/978-3-031-88720-...

work page doi:10.1007/978-3-031-88720-8_36 2025

[42] [42]

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. The Woman Worked as a Babysitter: On Biases in Language Generation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, ...

work page doi:10.18653/v1/d19-1339 2019

[43] [43]

Sugimoto, and Thema Monroe-White

Evan Shieh, Faye Marie Vassel, Cassidy R. Sugimoto, and Thema Monroe-White. 2025. Laissez-Faire Harms: Algorithmic Biases in Generative Language Models (Extended Abstract).Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society8, 3 (Oct. 2025), 2373–2374. doi:10.1609/aies.v8i3.36722

work page doi:10.1609/aies.v8i3.36722 2025

[44] [44]

Laura Spillner. 2024. Unexpected Gender Stereotypes in AI-generated Stories: Hairdressers Are Female, but so Are Doctors. InProceedings of the Text2Story’24 Workshop. Glasgow, Scotland. https://ceur-ws.org/Vol-3671/paper10.pdf

2024

[45] [45]

Goya van Boven, Yupei Du, and Dong Nguyen. 2024. Transforming Dutch: Debiasing Dutch Coreference Resolution Systems for Non-binary Pronouns. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24). Association for Computing Machinery, New York, NY, USA, 2470–2483. doi:10.1145/3630106.3659049

work page doi:10.1145/3630106.3659049 2024

[46] [46]

Melanie Walsh, Russell Samora, Michelle Pera-McGhee, and Jan Diehm. 2025. Bears Will Be Boys: A data analysis of animal gender in children’s books. https://pudding.cool/2025/07/kids-books/.The Pudding(July 2025)

2025

[47] [47]

Jules Watson, Xi Wang, Raymond Liu, Suzanne Stevenson, and Barend Beekhuizen. 2025. Analyzing values about gendered language reform in LLMs’ revisions. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computationa...

work page doi:10.18653/v1/2025.emnlp-main.1277 2025

[48] [48]

Jennie Yabroff. 2016. Why Are There so Few Girls in Children’s Books?The Washington Post(Jan. 2016). FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Imani Finkley, Yuanxi Li, and Melanie Walsh

2016

[49] [49]

Foulds, and Shimei Pan

Tao Zhang, Ziqian Zeng, YuxiangXiao YuxiangXiao, Huiping Zhuang, Cen Chen, James R. Foulds, and Shimei Pan. 2025. GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Sh...

work page doi:10.18653/v1/2025.acl-long.553 2025