arxiv: 2604.17008 · v1 · submitted 2026-04-18 · 💻 cs.CL

Recognition: unknown

BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories

Jingbo Zhu, Tong Xiao, yingfeng luo, Yuxuan Ouyang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual datasetLLM-generated storiesnarrative attributescross-lingual variabilitychildren's storiesbias analysisstory generationlanguage-specific patterns

0 comments

The pith

LLM-generated children's stories show different narrative attribute distributions across languages, and English patterns do not generalize.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a parallel dataset of roughly 350,000 children's stories generated by LLMs in eight typologically diverse languages through a full-permutation prompting approach. It then applies a generator-extractor pipeline and distributional analysis to measure how attributes such as character roles, settings, and themes appear in the output. The central finding is that these distributions vary substantially from one language to another, with patterns observed in English often failing to match those in other languages, especially lower-resource ones. A sympathetic reader would care because children's stories shape social and cultural learning, yet most AI safety and alignment checks remain English-only. The work supplies both the corpus and an interactive visualization to enable direct comparisons.

Core claim

We present BiasedTales-ML, a large-scale parallel corpus of approximately 350,000 children's stories generated across eight typologically and culturally diverse languages using a full-permutation prompting design. We propose a structured generator-extractor pipeline and a multi-dimensional distributional analysis framework to examine how narrative attributes vary across languages, models, and social conditions. Our analysis reveals substantial cross-lingual variability in narrative generation patterns, indicating that distributions observed in English do not always exhibit similar characteristics in other languages, particularly in lower-resource settings. At the narrative level, we identify

What carries the argument

The full-permutation prompting design paired with a generator-extractor pipeline that produces and analyzes the multilingual story corpus for attribute distributions.

If this is right

Distributions of narrative attributes in English do not match those in other languages.
Lower-resource languages display distinct characteristics in generated stories.
Recurring structural patterns in character roles, settings, and themes manifest differently by linguistic context.
English-centric evaluation is insufficient for characterizing socially grounded narrative generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alignment methods tuned on English data may leave language-specific biases unaddressed in non-English outputs.
The released dataset allows direct tests of whether particular models or prompt strategies reduce cross-language differences.
The approach could extend to measuring how cultural context beyond language affects LLM story generation.

Load-bearing premise

The full-permutation prompting design combined with the generator-extractor pipeline produces comparable and unbiased attribute extractions across typologically diverse languages without language-specific artifacts or model training data influences.

What would settle it

A replication using the same dataset but alternative extraction methods or different LLMs that finds identical attribute distributions across all eight languages would falsify the claim of substantial cross-lingual variability.

Figures

Figures reproduced from arXiv: 2604.17008 by Jingbo Zhu, Tong Xiao, yingfeng luo, Yuxuan Ouyang.

**Figure 1.** Figure 1: Global reach and linguistic diversity of the BIASEDTALES-ML dataset. We strategically selected eight languages to maximize cultural and typological coverage. The map highlights primary regions for: (1) High-resource global languages (e.g., English, Chinese, Spanish); (2) Gendered grammatical systems (e.g., Arabic, Russian); and (3) Distinct cultural narratives (e.g., Swahili, Japanese). The color-coded reg… view at source ↗

**Figure 2.** Figure 2: Bias Fingerprints Across Narrative Dimensions. Radar plots show the Log-Probability Ratio (SC ) for multiple narrative dimensions, where outward spikes (positive values) denote relative male association and inward spikes (negative values) denote relative female association. Similar geometric configurations are observed across languages. professional translation tools were utilized to ensure precise semant… view at source ↗

**Figure 3.** Figure 3: Cross-lingual Alignment Patterns Pairwise cosine similarity between bias fingerprint vectors across languages. Lighter colors indicate higher similarity. Qwen-3 displays more consistent cross-lingual patterns, whereas Llama-3 shows increased divergence, particularly in comparisons involving lower-resource languages [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Bias Strength by Grammatical Gender. Boxplots compare overall bias strength (Jensen– Shannon Divergence) between languages with grammatical gender (G-Group) and those without grammatical gender (N-Group). Higher values indicate greater divergence between gender-conditioned adjective distributions. 5.5 Generation Quality and Distributional Effects Finally, we analyze the relationship between generation… view at source ↗

**Figure 5.** Figure 5: Distinctive Lexical Markers in Narrative Generation (Selected Dimensions).The figure visualizes the most distinctive keywords identified by log-odds ratio for Gender (top) and Social Class (bottom).Keywords are grouped by narrative dimension (e.g., environment, attributes) and reflect systematic differences between conditioned groups.A full breakdown across additional dimensions, including Religion and Nat… view at source ↗

**Figure 6.** Figure 6: Generation Quality vs. Bias Strength. Scat [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Parallel prompt instances for a single demographic configuration (Egyptian Mother, Muslim, Working [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Screenshot of the Story Explorer View. The left sidebar provides global filters for demographic variables. The main panel displays retrieved stories alongside their metadata and automated qualitative tags (e.g., personality traits), allowing for detailed inspection of individual samples [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Screenshot of the Cross-Model Comparison Mode. This view automatically aligns stories generated by different models under identical prompt configurations. By placing narratives side-by-side, it highlights the divergence in content and bias patterns across different model families. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Full Bias Fingerprint: Qwen-3-8B. Shown are the most distinctive keywords (log-odds Z-scores) across narrative dimensions for Qwen-3-8B.Male-conditioned narratives contain a higher frequency of intellect-related descriptors, while patterns related to class and environment are also observable across languages. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Full Bias Fingerprint: Llama-3.1-8B. Shown are the most distinctive keywords (log-odds Z-scores) across narrative dimensions for Llama-3.1-8B.In the Gender dimension (Row 1), male- and female-conditioned narratives differ in their associated action-oriented and relational descriptors.In the Religion dimension (Row 3), Muslim- and Christian-conditioned narratives are associated with different sets of descr… view at source ↗

**Figure 12.** Figure 12: Full Bias Fingerprint: Llama-3.2-1B. Displayed are the most distinctive keywords (log-odds Zscores) across narrative dimensions for Llama-3.2-1B.Compared to larger models, the distribution shows reduced lexical variety across several dimensions, particularly in cultural descriptors.Associations involving social class and nationality are also observable among the high-frequency keywords. 17 [PITH_FULL_IM… view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly used to generate narrative content, including children's stories, which play an important role in social and cultural learning. Despite growing interest in AI safety and alignment, most existing evaluations focus primarily on English, leaving the cross-lingual generalization of aligned behavior underexplored. In this work, we introduce BiasedTales-ML, a large-scale parallel corpus of approximately 350,000 children's stories generated across eight typologically and culturally diverse languages using a full-permutation prompting design. We propose a structured generator-extractor pipeline and a multi-dimensional distributional analysis framework to examine how narrative attributes vary across languages, models, and social conditions. Our analysis reveals substantial cross-lingual variability in narrative generation patterns, indicating that distributions observed in English do not always exhibit similar characteristics in other languages, particularly in lower-resource settings. At the narrative level, we identify recurring structural patterns involving character roles, settings, and thematic emphasis, which manifest differently across linguistic contexts. These findings highlight the limitations of English-centric evaluation for characterizing socially grounded narrative generation in multilingual settings. We release the dataset, code, and an interactive visualization tool to support future research on multilingual narrative analysis and evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BiasedTales-ML supplies a new large parallel dataset of LLM stories across eight languages, but the cross-lingual variability claims rest on an extractor whose accuracy per language is not shown.

read the letter

The main thing to know is that this paper releases BiasedTales-ML, roughly 350,000 children's stories generated in eight typologically diverse languages with a full-permutation prompting design that varies models, conditions, and attributes systematically. That scale and the parallel structure across languages is a concrete addition to the resources for studying narrative generation beyond English-only setups. They also release the code and an interactive visualization tool, which lowers the barrier for others to inspect or extend the data. The multi-dimensional framework for tracking distributions of character roles, settings, and themes is a straightforward way to organize the comparisons, and the focus on lower-resource languages addresses a real gap in current AI safety and alignment work on stories. Prior English-centric bias evaluations do not cover this kind of controlled multilingual sweep, so the dataset itself has standalone value even if the analysis needs more support. The soft spot is the generator-extractor pipeline. Attribute extraction is done via LLM prompting after generation, yet the paper provides no per-language validation, human agreement checks, or error rates for the extractor. LLM extractors are known to drop in performance on lower-resource languages because of training data skew and tokenization, so any distributional shifts reported could mix real generation differences with extraction artifacts. The abstract states substantial cross-lingual variability without numbers or calibration details, which makes it hard to judge how much of the finding holds up. This work is for researchers who need multilingual story corpora for bias measurement, cultural representation studies, or alignment testing in narrative domains. A reader already working on non-English LLM evaluation would get immediate use from the released data. It deserves peer review because the dataset construction and prompting design are new enough to merit referee scrutiny on the validation gaps and on whether the analysis framework produces reproducible distributions.

Referee Report

2 major / 1 minor

Summary. The paper introduces BiasedTales-ML, a parallel corpus of approximately 350,000 LLM-generated children's stories across eight typologically diverse languages, constructed via a full-permutation prompting design. It describes a generator-extractor pipeline to extract narrative attributes (character roles, settings, themes) and applies a multi-dimensional distributional analysis framework, revealing substantial cross-lingual variability in generation patterns and arguing that English-centric distributions do not generalize, especially in lower-resource languages. The manuscript releases the dataset, code, and an interactive visualization tool.

Significance. If the variability findings are robust, the work would be significant for multilingual NLP and AI safety research by providing evidence against assuming English narrative distributions generalize and by supplying a large-scale resource for studying cultural and linguistic biases in generated stories. The public release of the dataset, code, and visualization tool is a clear strength that supports reproducibility and follow-on work.

major comments (2)

The headline claim of substantial cross-lingual variability in narrative attribute distributions (abstract and analysis framework) rests on the generator-extractor pipeline producing comparable, unbiased extractions across all eight languages. No per-language validation metrics, human evaluation, inter-annotator agreement, or error analysis for the extractor is described, leaving open the possibility that observed shifts in lower-resource languages reflect LLM extractor artifacts rather than generation differences.
The abstract states that the analysis 'reveals substantial cross-lingual variability' and identifies 'recurring structural patterns' but provides no quantitative results, statistical tests, effect sizes, or tables summarizing the distributional differences. This weakens support for the central claim that English observations 'do not always exhibit similar characteristics' in other languages.

minor comments (1)

The abstract refers to 'approximately 350,000' stories but does not break down the count by language, model, or condition; adding this detail would clarify the balance and scale of the parallel corpus.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript introducing BiasedTales-ML. We address each major comment below in detail, indicating planned revisions to strengthen the work while maintaining fidelity to our original contributions and resource constraints.

read point-by-point responses

Referee: The headline claim of substantial cross-lingual variability in narrative attribute distributions (abstract and analysis framework) rests on the generator-extractor pipeline producing comparable, unbiased extractions across all eight languages. No per-language validation metrics, human evaluation, inter-annotator agreement, or error analysis for the extractor is described, leaving open the possibility that observed shifts in lower-resource languages reflect LLM extractor artifacts rather than generation differences.

Authors: We appreciate the referee's emphasis on methodological rigor for the extractor. The manuscript describes a structured generator-extractor pipeline using consistent prompting across languages to promote comparability, but we acknowledge that it does not report per-language validation metrics, human evaluation, or inter-annotator agreement. In the revision, we will add a dedicated error analysis subsection that samples stories from each of the eight languages, reports extraction consistency metrics where feasible, and discusses potential artifacts. This will provide additional evidence that the observed distributional shifts reflect generation patterns rather than extractor issues. Full-scale human annotation across the entire corpus remains impractical given the scale (approximately 350,000 stories), but the added analysis will address the core concern. revision: partial
Referee: The abstract states that the analysis 'reveals substantial cross-lingual variability' and identifies 'recurring structural patterns' but provides no quantitative results, statistical tests, effect sizes, or tables summarizing the distributional differences. This weakens support for the central claim that English observations 'do not always exhibit similar characteristics' in other languages.

Authors: The abstract is intentionally concise to summarize the contribution and cannot accommodate full quantitative details. The main manuscript presents the multi-dimensional distributional analysis with figures, comparative tables, and descriptions of variability across languages, models, and conditions. To directly address this point, we will revise the abstract to include a brief mention of key quantitative highlights (e.g., specific percentage differences in character role distributions between English and lower-resource languages, along with reference to statistical comparisons). This will better foreground the evidence for the central claim without exceeding typical abstract length limits. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical distributions from new multilingual corpus

full rationale

The paper constructs a new parallel corpus of ~350k LLM-generated children's stories across eight languages via full-permutation prompting, applies an LLM-based generator-extractor pipeline to label narrative attributes (roles, settings, themes), and reports observed distributional differences. All central claims about cross-lingual variability rest on direct computation of empirical frequencies and patterns in this freshly generated data rather than any parameter fitting, self-citation chains, uniqueness theorems, or renamings that would reduce the reported results to the inputs by construction. No equations, ansatzes, or load-bearing citations to prior author work appear in the derivation; the analysis framework simply tabulates and compares attribute counts across language/model conditions on the released dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of prompt-based generation as a proxy for model biases and the cross-lingual reliability of the attribute extraction pipeline; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Systematic prompting of LLMs can surface stable narrative attribute distributions that reflect model behavior across languages
Invoked in the generator-extractor pipeline and distributional analysis to interpret variability as evidence of bias differences.

pith-pipeline@v0.9.0 · 5525 in / 1326 out tokens · 75890 ms · 2026-05-10T06:49:31.818926+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 13 canonical work pages · 5 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

BedtimeStory.ai. 2023. https://bedtimestory.ai AI Powered Story Creator Bedtimestory .ai

2023
[3]

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610--623

2021
[4]

Su Lin Blodgett, Solon Barocas, Hal Daum \'e III, and Hanna Wallach. 2020. https://doi.org/10.18653/v1/2020.acl-main.485 Language (technology) is power: A critical survey of `` bias '' in NLP . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454--5476, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.485 2020
[5]

Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183--186

2017
[6]

Victoria Cooper. 2014. Children’s developing identity. A critical companion to early childhood, pages 281--296

2014
[7]

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. 2023. Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474

work page arXiv 2023
[8]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Daniel Hershcovich, Stella Frank, Heather Lent, Miryam De Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, Laura Cabello Piqueras, Ilias Chalkidis, Ruixiang Cui, and 1 others. 2022. Challenges and strategies in cross-cultural nlp. arXiv preprint arXiv:2203.10020

work page arXiv 2022
[10]

Nicole Kobie. 2023. https://www.wired.com/story/bluey-gpts-bedtime-stories-artificial-intelligence-copyright/ AI Is Telling Bedtime Stories to Your Kids Now . Wired. Section: tags

2023
[11]

J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics, pages 159--174

1977
[12]

Jianhua Lin. 2002. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory, 37(1):145--151

2002
[13]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634

work page internal anchor Pith review arXiv 2023
[14]

Li Lucy and David Bamman. 2021. Gender and representation bias in gpt-3 generated stories. In Proceedings of the third workshop on narrative understanding, pages 48--55

2021
[15]

Kristian Lum, Jacy Reese Anthis, Kevin Robinson, Chirag Nagpal, and Alexander Nicholas D'Amour. 2025. https://aclanthology.org/2025.acl-long.7/ Bias in language models: Beyond trick tests and towards ruted evaluation . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Aust...

2025
[16]

Yingfeng Luo, Ziqiang Xu, Yuxuan Ouyang, Murun Yang, Dingyang Lin, Kaiyan Chang, Tong Zheng, Bei Li, Peinan Feng, Quan Du, Tong Xiao, and Jingbo Zhu. 2025. https://doi.org/10.48550/ARXIV.2511.07003 Beyond english: Toward inclusive and scalable multilingual machine translation with llms . CoRR, abs/2511.07003

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.07003 2025
[17]

Burt L Monroe, Michael P Colaresi, and Kevin M Quinn. 2008. Fightin'words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4):372--403

2008
[18]

Moin Nadeem, Anna Bethke, and Siva Reddy. 2020. Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456

work page arXiv 2020
[19]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730--27744

2022
[20]

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. 2022. https://arxiv.org/abs/2110.08193 Bbq: A hand-built bias benchmark for question answering . Preprint, arXiv:2110.08193

work page arXiv 2022
[21]

Donya Rooein, Amanda Cercas Curry, and Dirk Hovy. 2023. https://arxiv.org/abs/2312.02065 Know your audience: Do LLMs adapt to different age and education levels? arXiv preprint arXiv:2312.02065

work page arXiv 2023
[22]

Donya Rooein, Vil \' e m Zouhar, Debora Nozza, and Dirk Hovy. 2025. https://doi.org/10.48550/ARXIV.2509.07908 Biased tales: Cultural and topic bias in generating children's stories . CoRR, abs/2509.07908

work page doi:10.48550/arxiv.2509.07908 2025
[23]

Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. 2024. The language barrier: Dissecting safety challenges of llms in multilingual contexts. arXiv preprint arXiv:2401.13136

work page arXiv 2024
[24]

Spriha Srivastava. 2023. https://www.businessinsider.com/i-use-chatgpt-write-bedtime-stories-my-5-year-old-2023-4 I use ChatGPT to write stories for my 5-year-old. It 's fun, innovative, and makes bedtime less stressful

2023
[25]

Qwen Team. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079--80110

2023
[27]

Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen Bach, and Julia Kreutzer. 2025. The state of multilingual llm safety research: From measuring the language gap to mitigating it. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15856--15871

2025
[28]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, and 1 others. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595--46623

2023
[29]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[30]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...