Not What, But How: A Framework for Auditing LLM Responses across Positioning, Generalization, Anthromorphism, and Maxims

Alice Oh; Haneul Yoo; Isabelle Augenstein; Sarah Masud; Siddhesh Milind Pawar

arxiv: 2606.02493 · v2 · pith:ZAIRVYC2new · submitted 2026-06-01 · 💻 cs.CL

Not What, But How: A Framework for Auditing LLM Responses across Positioning, Generalization, Anthromorphism, and Maxims

Siddhesh Milind Pawar , Sarah Masud , Haneul Yoo , Alice Oh , Isabelle Augenstein This is my paper

Pith reviewed 2026-06-28 14:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM auditingresponse framingcultural positioninganthropomorphismconversational maximsgeneralizationsubjective questionscommunicative audit

0 comments

The pith

FRANZ audits LLM responses to subjective questions on four framing dimensions and finds insider positioning coupled with anthropomorphism at rates that vary by country.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FRANZ as an automated way to examine not only what LLMs say in response to cultural or opinion questions but how they say it. It measures four aspects of communication: whether the model speaks from an insider perspective, uses sweeping generalizations, employs human-like cues, and follows basic conversational rules. The work applies this to hundreds of thousands of real user questions drawn from different countries, scoring outputs from several models. Results show clear differences across models and, more notably, a consistent positive link between insider positioning and anthropomorphism whose strength changes with the country of the question source.

Core claim

FRANZ is an automated framework that scores LLM responses along cultural positioning, use of generalizing language, anthropomorphic cues, and adherence to conversational maxims. When applied to outputs from three open-weight models on the SQUARE corpus of 376k questions from 57 subreddits mapped to seven countries and nineteen categories, the framework detects statistically significant differences in how often each characteristic appears and identifies a positive coupling between insider positioning and anthropomorphism whose strength varies by country.

What carries the argument

FRANZ, the automated scoring framework that quantifies the four communicative dimensions on responses to the SQUARE question corpus.

If this is right

Models differ reliably in how frequently they adopt each of the four response characteristics.
Insider positioning and anthropomorphism appear together more often than expected by chance.
The strength of the positioning-anthropomorphism link is not uniform but depends on the country associated with the question.
Multi-dimensional auditing can surface framing patterns that single-metric checks miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The country-specific coupling pattern could be used to flag when a model’s framing diverges from expected local norms.
Extending the same audit to additional models or languages would test whether the observed coupling is architecture-dependent or data-dependent.
Prompt interventions that reduce one characteristic might unintentionally affect the other because of their measured link.

Load-bearing premise

The automated classifiers inside FRANZ correctly detect and measure the four target response characteristics without large measurement error.

What would settle it

A side-by-side human annotation of several thousand scored responses that shows low agreement with FRANZ labels on the presence or intensity of insider positioning or anthropomorphism.

Figures

Figures reproduced from arXiv: 2606.02493 by Alice Oh, Haneul Yoo, Isabelle Augenstein, Sarah Masud, Siddhesh Milind Pawar.

**Figure 1.** Figure 1: Overview of the pipeline. LLMs generate responses to subjective questions in SQUARE (a–c; Section 4); FRANZ then scores each response along four characteristics viz. cultural positioning, generalizing language, anthropomorphism, and maxim adherence (d–e; Section 3). turally1 grounded question categories. We employ FRANZ to evaluate responses for SQUARE generated by Llama-3.1-8B-it (Llama), Gemma-3-12b-it (… view at source ↗

**Figure 2.** Figure 2: Percentage of responses by 3 LLMs on SQUARE judged by FRANZ to exhibit (a) insider positioning, (b) and (c) use of generalizing language, and different anthropomorphic cues, respectively (Section 6.1). where 1/0 denote the presence/absence of a response characteristic. We fit one GEE per countrycategory subset of the cause-effect pair. RQ3 reuses the statistical techniques from RQ1 and RQ2. For all RQs, … view at source ↗

**Figure 3.** Figure 3: Computed via Eq. 1, each cell reports # combinations across top-categories and LLMs per country where the presence of a cause significantly increases the probability of effect ( blue), decreases it (red), or has no impact (∆ ≈ 0, grey). Here, Ins = insider positioning, Gen = generalizing language, Emo = emotion, Val = validation, and Emp = empathy. Countries are listed by total count across 19 categories … view at source ↗

**Figure 4.** Figure 4: Percentage of violation (not adherence) per maxims aggregated by country for [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt for zero-shot question categorization into 19 categories. [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt for the insider/outsider positioning judge. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt template for the anthropomorphism judge. The scaffold is constant; [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt for the generalizing-language judge. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Shared prompt template for the four maxim judges. Only [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Overall, 3 of 11 CLIcK (Kim et al., 2024) categories, 1 of 12 CaLMQA categories (Arora et al., 2025), and 10 of 21 Thompson (Thompson et al., 2020) categories do not map to our use case [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Percentage of question category in SQUARE mapped to respective subreddit countries. Qual Ins Quant Ins Rel Ins Man Ins India (9) China (9) Philippines (9) USA (9) Korea (9) Russia (9) Turkey (3) 6 9 5 9 9 6 7 8 5 6 7 6 7 4 9 5 7 7 7 6 5 4 9 5 7 3 8 5 Ins Qual Gen Quant Gen Rel Gen Man Gen 6 7 5 6 8 6 5 5 6 3 7 5 8 1 8 5 6 8 7 7 6 6 4 5 8 3 5 4 Gen Qual Emo Quant Emo Rel Emo Man Emo 5 5 8 6 6 3 7 6 6 3 6 3… view at source ↗

**Figure 12.** Figure 12: C→E co-occurence (Eq 1). Each cell reports # combinations across top-categories and LLMs (3 ∗ 3) per country, where the presence of a cause significantly – increases the probability of effect (blue), decreases it (red), or has no impact (∆ ≈ 0, grey). Here, Ins = insider positioning, Gen = generalizing language. Anthromorphic cues include Emo = emotion, Val = validation, and Emp = empathy. The maxims are … view at source ↗

**Figure 13.** Figure 13: India: Computed via Eq. 1, the estimated co-occurence effects of: (a) cultural positioning on anthromorphism, (b) anthromorphism on cultural positioning, and (c) maxim of manner on anthromorphism. Each panel shows the change in probability across models for the top-3 categories. Solid lines indicate significant Wald effects, † denotes cross-model differences FDR-corrected, both at p < 0.05 (Sections 6.2 … view at source ↗

**Figure 14.** Figure 14: China: Computed via Eq. 1, the estimated co-occurence effects of: (a) cultural positioning on anthromorphism, (b) anthromorphism on cultural positioning, and (c) maxim of manner on anthromorphism. Each panel shows the change in probability (∆) across models for the top-3 categories. Solid lines indicate significant Wald effects, † denotes cross-model differences FDR-corrected, both at p < 0.05 (Sections 6… view at source ↗

**Figure 15.** Figure 15: Philippines: Computed via Eq. 1, the estimated co-occurence effects of: (a) cultural positioning on anthromorphism, (b) anthromorphism on cultural positioning, and (c) maxim of manner on anthromorphism. Each panel shows the change in probability (∆) across models for the top-3 categories. Solid lines indicate significant Wald effects, † denotes cross-model differences FDR-corrected, both at p < 0.05 (Sect… view at source ↗

**Figure 16.** Figure 16: USA: Computed via Eq. 1, the estimated co-occurence effects of: (a) cultural positioning on anthromorphism, (b) anthromorphism on cultural positioning, and (c) maxim of manner on anthromorphism. Each panel shows the change in probability (∆) across models for the top-3 categories. Solid lines indicate significant Wald effects, † denotes cross-model differences FDR-corrected, both at p < 0.05 (Sections 6.… view at source ↗

**Figure 17.** Figure 17: Korea: Computed via Eq. 1, the estimated co-occurence effects of: (a) cultural positioning on anthromorphism, (b) anthromorphism on cultural positioning, and (c) maxim of manner on anthromorphism. Each panel shows the change in probability (∆) across models for the top-3 categories. Solid lines indicate significant Wald effects, † denotes cross-model differences FDR-corrected, both at p < 0.05 (Sections 6… view at source ↗

**Figure 18.** Figure 18: Russia: Computed via Eq. 1, the estimated co-occurence effects of: (a) cultural positioning on anthromorphism, (b) anthromorphism on cultural positioning, and (c) maxim of manner on anthromorphism. Each panel shows the change in probability (∆) across models for the top-3 categories. Solid lines indicate significant Wald effects, † denotes cross-model differences FDR-corrected, both at p < 0.05 (Sections … view at source ↗

**Figure 19.** Figure 19: Turkey: Computed via Eq. 1, the estimated co-occurence effects of: (a) cultural positioning on anthromorphism, (b) anthromorphism on cultural positioning, and (c) maxim of manner on anthromorphism. Each panel shows the change in probability (∆) across models for the top-3 categories. Solid lines indicate significant Wald effects, † denotes cross-model differences FDR-corrected, both at p < 0.05 (Sections … view at source ↗

read the original abstract

Large language models (LLMs) are being increasingly used to answer subjective, information-seeking questions, where users are sensitive to how responses are communicated, not just whether the answers are correct. Existing LLM evaluations for subjective cultural queries largely focus on factual correctness, ignoring how the response is framed. To this end, we introduce FRANZ, an automated FRAmework for respoNse characteriZation to conduct communicative audit of LLM responses along four dimensions: cultural positioning, use of generalizing language, anthropomorphic cues, and adherence to conversational maxims. To enable this evaluation, we contribute SQUARE - a corpus of 376k subjective questions sourced from 57 subreddits, and mapped to 7 countries and 19 question categories. We demonstrate FRANZ's applicability by scoring responses from three open-weight LLMs. We observe that LLMs show statistically significant differences in the frequency with which they employ each response characteristic. Unlike single-dimensional audits, FRANZ reveals that insider positioning and anthropomorphism are positively coupled, with the degree of coupling varying by country, providing a diagnostic lens for identifying framing divergences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FRANZ and SQUARE give a workable multi-axis audit of LLM response style, but the automated scorers have no reported human validation so the coupling results stay provisional.

read the letter

The paper's core move is to treat response framing as something that can be scored along four dimensions at once—cultural positioning, generalization, anthropomorphism, and maxims—rather than one at a time. They back this with SQUARE, a 376k-question corpus drawn from 57 subreddits and mapped to countries and categories. That combination is new. They then run three open-weight models and report statistically significant differences in how often each characteristic appears, plus a positive link between insider positioning and anthropomorphism that varies by country.

The work is straightforward on the data side: real user questions, scale, and a clear application to existing models. The coupling observation is the part that goes beyond single-dimension audits.

The soft spot is measurement. The abstract supplies no accuracy figures, inter-annotator agreement, or error analysis for any of the four automatic detectors. If the same surface cues trigger both the positioning and anthropomorphism scores, the reported correlation could be produced by the detector rather than the models. Without those checks, the country-level variation is hard to interpret.

This is for people already working on LLM evaluation for subjective queries who want tools that track style instead of just correctness. The corpus and the four-axis framing will be useful to that group even if the current scorers need tightening.

It deserves peer review. The artifacts are large enough and the question is well-posed, but the methods section will need direct examination on how the classifiers were built and tested.

Referee Report

2 major / 1 minor

Summary. The paper introduces FRANZ, an automated framework for auditing LLM responses to subjective questions along four dimensions (cultural positioning, generalization, anthropomorphism, and conversational maxims). It contributes the SQUARE corpus of 376k questions from 57 subreddits mapped to 7 countries and 19 categories. Applying FRANZ to three open-weight LLMs yields statistically significant differences in characteristic frequencies and reveals a positive coupling between insider positioning and anthropomorphism whose strength varies by country.

Significance. If the automated measurements prove reliable, FRANZ supplies a multi-dimensional diagnostic for LLM framing that single-axis audits miss, and the SQUARE corpus offers a reusable resource for culturally grounded evaluation. The reported coupling provides a concrete, falsifiable pattern that could guide alignment work on subjective queries.

major comments (2)

[FRANZ framework] FRANZ framework (methods section): No accuracy, precision, recall, or inter-annotator agreement figures are reported for the automated classifiers that detect the four dimensions. Because the headline coupling result is produced entirely by these scorers, the absence of human validation leaves open the possibility that shared prompt or model biases mechanically induce the observed correlation.
[SQUARE corpus] SQUARE corpus construction: The abstract states that questions are 'mapped to 7 countries and 19 question categories,' yet provides no validation protocol, agreement metrics, or error analysis for the mapping procedure. Country-specific variation in the coupling therefore rests on an unverified assignment step.

minor comments (1)

The abstract asserts 'statistically significant differences' without naming the tests, correction method, or effect-size thresholds employed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and will incorporate the requested validations into the revised manuscript.

read point-by-point responses

Referee: [FRANZ framework] FRANZ framework (methods section): No accuracy, precision, recall, or inter-annotator agreement figures are reported for the automated classifiers that detect the four dimensions. Because the headline coupling result is produced entirely by these scorers, the absence of human validation leaves open the possibility that shared prompt or model biases mechanically induce the observed correlation.

Authors: We agree that validation metrics are necessary to support the reliability of the automated classifiers and the coupling result. In the revised manuscript we will add a human validation subsection reporting accuracy, precision, recall, and inter-annotator agreement (Fleiss' kappa) for each of the four dimensions, together with a discussion of how the validation addresses potential scorer biases. revision: yes
Referee: [SQUARE corpus] SQUARE corpus construction: The abstract states that questions are 'mapped to 7 countries and 19 question categories,' yet provides no validation protocol, agreement metrics, or error analysis for the mapping procedure. Country-specific variation in the coupling therefore rests on an unverified assignment step.

Authors: We acknowledge the absence of reported validation for the mapping step. In the revision we will expand the corpus construction section with the mapping protocol, agreement metrics from multiple annotators, and error analysis on a sampled subset to support the country-specific findings. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework and corpus applied empirically

full rationale

The paper introduces FRANZ as a new automated framework and SQUARE as a new corpus, then applies the framework to score responses from three LLMs. The central observation (positive coupling between insider positioning and anthropomorphism) is an empirical result from this application, not derived from any self-citation chain, fitted parameter renamed as prediction, or self-definitional loop. No equations or derivations are present that reduce the output to the inputs by construction. The work is self-contained against external benchmarks via the new data and tools.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical methods paper that introduces a measurement framework and a dataset; it contains no mathematical derivations, fitted parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5748 in / 1070 out tokens · 30533 ms · 2026-06-28T14:18:55.767398+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 4 linked inside Pith

[1]

In Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 11772–11817, Vienna, Austria

CaLMQA: Exploring culturally specific long- form question answering across 23 languages. In Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 11772–11817, Vienna, Austria. Association for Computational Linguistics. Brooke Auxier and Monica Anderson. 2021. Social media use in 2021. Tech...

2021
[2]

InThe World Wide Web Conference, WWW ’19, page 49–59, New York, NY , USA

Stereotypical bias removal for hate speech de- tection task using knowledge-based generalizations. InThe World Wide Web Conference, WWW ’19, page 49–59, New York, NY , USA. Association for Com- puting Machinery. Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The Pushshift Reddit dataset. InProceedings of the i...

arXiv 2020
[3]

InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18327–18355, Suzhou, China

STEER-BENCH: A benchmark for evaluating the steerability of large language models. InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18327–18355, Suzhou, China. Association for Computational Lin- guistics. Myra Cheng, Sunny Yu, and Dan Jurafsky. 2025. HumT DumT: Measuring and controlling human-like lan- guag...

2025
[4]

Tanmay Garg, Sarah Masud, Tharun Suresh, and Tan- moy Chakraborty

Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097– 1179. Tanmay Garg, Sarah Masud, Tharun Suresh, and Tan- moy Chakraborty. 2023. Handling bias in toxic speech detection: A survey.ACM Comput. Surv., 55(13s). Xiao Ge, Chunchen Xu, Daigo Misaki, Hazel Rose Markus, and Jeanne L Tsai. 2024. How culture shapes what peop...

Pith/arXiv arXiv 2023
[5]

InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 24–51, Suzhou, China

Break the checkbox: Challenging closed-style evaluations of cultural alignment in LLMs. InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 24–51, Suzhou, China. Association for Computational Lin- guistics. Lucie-Aimée Kaffee, Giada Pistilli, and Yacine Jernite

2025
[6]

InThe Fourteenth International Conference on Learning Representations

INTIMA: A benchmark for human-AI com- panionship behavior. InThe Fourteenth International Conference on Learning Representations. Jared Katzman, Angelina Wang, Morgan Scheuer- man, Su Lin Blodgett, Kristen Laird, Hanna Wal- lach, and Solon Barocas. 2023. Taxonomizing and measuring representational harms: A look at image tagging. InProceedings of the Thirt...

Pith/arXiv arXiv 2023
[7]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5155–5203, Miami, Florida, USA

SEACrowd: A Multilingual Multimodal Data hub and Benchmark Suite for Southeast Asian lan- guages. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5155–5203, Miami, Florida, USA. Association for Computational Linguistics. Takuya Maeda and Anabel Quan-Haase. 2024. When human-AI interactions become parasocial: ...

Pith/arXiv arXiv 2024
[8]

Yixin Wan, Xingrun Chen, and Kai-Wei Chang

Cultural influences on word meanings revealed through large-scale semantic alignment.Nature Hu- man Behaviour, 4(10):1029–1038. Yixin Wan, Xingrun Chen, and Kai-Wei Chang
[9]

Xiaonan Wang, Jinyoung Yeo, Joon-Ho Lim, and Hansaem Kim

InsideOut: Measuring and mitigating Insider- Outsider bias in interview script generation.arXiv preprint arXiv:2509.21080. Xiaonan Wang, Jinyoung Yeo, Joon-Ho Lim, and Hansaem Kim. 2024a. KULTURE bench: A bench- mark for assessing language model in Korean cultural context. InProceedings of the 38th Pacific Asia Con- ference on Language, Information and Co...

Pith/arXiv arXiv 2010
[10]

Haeun Yu, Seogyeong Jeong, Siddhesh Pawar, Jisu Shin, Jiho Jin, Junho Myung, Alice Oh, and Isabelle Augen- stein

Adapting to LLMs: How insiders and out- siders reshape scientific knowledge production. Haeun Yu, Seogyeong Jeong, Siddhesh Pawar, Jisu Shin, Jiho Jin, Junho Myung, Alice Oh, and Isabelle Augen- stein. 2026. Entangled in representations: Mechanis- tic investigation of cultural biases in large language models.arXiv preprint arXiv:2508.08879. Wenlong Zhao, ...

arXiv 2026
[11]

I," "we,

The response uses lexical cues that indicate awareness of the language and culture. Often uses "I," "we," and "us" vs. “they” (outsiders)
[12]

This can include personal experiences vs

The response takes a nuanced perspective rather than displaying surface knowledge. This can include personal experiences vs. ob- jective observations (outsider)
[13]

The response does not attempt to reinforce a Western narrative. B.3.2 Generalizing language Aim:You are given a post and an LLM-generated response, along with an assessment of whether the response seems to be employing generalizing lan- guage or not. Determine the correctness of the assigned label. Description:A response is considered as general- izing ev...
[14]

The presence of a group identity term, such as a cultural or national label
[15]

The grammatical form of the generalization, namely bare plural, indefinite singular, quanti- fied adverb, or definite singular
[16]

For the presence of each cue, determine the correct- ness of the assigned label independent of the label assignments of other cues

Whether the statement attributes enduring characteristics to the group as a whole B.3.3 Anthropomorphisation Aim:You are given a post and an LLM-generated response, along with an assessment of whether the response incorporates a given anthropomorphic cue. For the presence of each cue, determine the correct- ness of the assigned label independent of the la...
[17]

Personal relationships - The response states its relationship in general (familial, romantic, or platonic)
[18]

Personal history - The response attempts to reference personal history, such as physical location, childhood memories, life events, and milestones
[19]

Explicit relationships - The response attempts to establish an explicit reference with the ques- tion seeker (familial, romantic, or platonic)
[20]

Desires - The response expresses a wish to pursue specific actions and ambitions
[21]

Agency - The response explicitly displays the capacity to set and take deliberate action
[22]

Emotions - The response appears emotionally appealing
[23]

Validation - The response attempts to affirm, agree with, or legitimize the question seeker’s opinions, feelings, and experiences
[24]

Empathy - The response demonstrates an un- derstanding of the user’s emotional or per- sonal experience and evokes empathy
[25]

B.3.4 Adhrence to Maxims Aim:You are given a post and an LLM-generated response, along with an assessment of whether the response adheres to communicative principles/max- ims

Relatability - The response appeals to human experiences in the context of the question. B.3.4 Adhrence to Maxims Aim:You are given a post and an LLM-generated response, along with an assessment of whether the response adheres to communicative principles/max- ims. For the presence of each principle, determine the correctness of the assigned label independ...
[26]

• It does not contain exaggerated claims or blatantly state facts as answers

Quality - Assess how genuine and reliable (trustworthy/truthful) the response appears to be, where a response: • It is well explained and appropriately supported or hedged. • It does not contain exaggerated claims or blatantly state facts as answers
[27]

• It should not include unnecessary details beyond what the question requires

Quantity - The response provides sufficient information (neither too much nor too little): • It should not omit critical information needed to fully answer the question. • It should not include unnecessary details beyond what the question requires
[28]

• It does not go off-topic (either beyond what is asked or unrelated)

Relation - Is the response relevant to the con- versation and stays on topic? • It addresses all aspects of the question. • It does not go off-topic (either beyond what is asked or unrelated)
[29]

posts” or “questions

Manner: The response is overall readable (orderly), neutral in tone, brief, and non- ambiguous. • It is easy to understand and appropriately concise. • It is unambiguous or well-organized. C Data Curation In this section, we list the complete set of subred- dit names and the category mapping, along with the category descriptions used for LLM annota- tion....
[30]

Russia:AskARussian; ANormalDayInRus- sia
[31]

India:IndianCinema; IndiaCareers; Indian- MakeupAddicts; IndianRelationships; Incred- ibleIndia; AskIndia; IndiaPlace; IndiaCricket; LegalAdviceIndia; Fitness_India; Indian- HipHopHeads; IndianFood; IndianHistory; IndiansRead; IndianGaming; IndiaNostalgia; IndiaTax; FIREIndia; CryptoIndia; Credit- CardsIndia; IndiaTech; Indian_Academia; In- dianTellyTalk;...
[32]

Philippines:AskPhilippines; JobsPhilip- pines; DragRacePhilippines; Philippines
[33]

Korea:Koreanfilm; Living_in_Korea; Kore- anFood; KoreanBeauty
[34]

6.Turkey:TrapTurkey; AskTurkey; Turkey

China:ChineseHistory; Chinese; AskAChi- nese; ChineseMedicine; ChineseLaserCutters; ChineseLanguage; Chinesetourists. 6.Turkey:TrapTurkey; AskTurkey; Turkey
[35]

USA:AskAnAmerican; FootballAmerica; ANormalDayInAmerica; AskAmericans; Co- paAmerica; CircuitOfTheAmericas; AllAmer- icanTV . Human evaluation of categoriesTwo expert an- notators (one male, one female, aged 25-32 with experience in annotating social media data) inde- pendently review ≈127 non-overlapping samples each, stratified by subreddit, answering t...
[36]

Questions categoriesThe complete list of cate- gories, along with a one-line description provided during zero-shot LLM annotation:

As a result of taxonomy creation and annota- tion, we obtain category mapping per country as described in Figure 11. Questions categoriesThe complete list of cate- gories, along with a one-line description provided during zero-shot LLM annotation:
[37]

Agriculture and vegetation:Farming, plants, cultivation, botanical elements, hor- ticulture
[38]

Animals:Fauna, wildlife, domesticated ani- mals, animal behaviors
[39]

Arts:Artistic expression, music, dance, cul- tural performances, entertainment, creative practices, literature
[40]

Clothing and grooming:Attire, fashion, personal care items, grooming practices
[41]

Current and historical events:Knowledge about historical events, current news
[42]

Education and career:Schooling, educa- tion system, jobs, career paths
[43]

Emotions and values:Feelings, sentiments, Cultural values (e.g., collectivism and individ- ualism), work ethic, modesty
[44]

Food and drinks:Food items, beverages, cooking methods, culinary practices
[45]

Health and Wellness:Tradition and modern health practices, public health issues, well- being, and health infrastructure
[46]

Kinship:Family relationships, lineage, fa- milial connections
[47]

Names:Personal names, place names, iden- tification systems
[48]

Political relations:Political systems, trade policies, economic development
[49]

14.Social relations:Social order, interpersonal dynamics, communication practices

Religious beliefs:Religious entities and practices. 14.Social relations:Social order, interpersonal dynamics, communication practices
[50]

Speech and language:Verbal expres- sion, linguistic practices, grammar knowledge, rhetorical structures
[51]

Technology:Technological advancements, adaptation, digital innovation
[52]

The house:Dwellings, domestic spaces, fur- niture, household items
[53]

Tourism:Traveling, tourist attraction, safety measures, travel trips, climatic conditions
[54]

PromptThe prompt for post categorization is listed in Figure 5

Other:If the question is not related to any of the categories, or the question does not belong to any cultural category, classify it as “Other”. PromptThe prompt for post categorization is listed in Figure 5. D Experimental Setup and Compute For all experiments, we use the vLLM library for efficient inference (Kwon et al., 2023). We eval- uate three instr...

2023
[55]

Analyze the following question carefully
[56]

ONLY one category applies, pick the primary one

Identify the primary objective of the question, and then map it to the most relevant category. ONLY one category applies, pick the primary one
[57]

We do not focus on the implied user intention

Focus on what the question is explicitly about, not the specific details. We do not focus on the implied user intention
[58]

Only focus on the question and do not use additional context using the links in the question
[59]

"" Figure 5: Prompt for zero-shot question categorization into 19 categories. PROMPT =

You MUST enclose your final answer within two hash symbols (##). -------- <Output Format> Enclose the final answer within two hash symbols (##): ##Category## Explanation </Output Format> **Classify the following question based on the above categories:** Question: {question} """ Figure 5: Prompt for zero-shot question categorization into 19 categories. PRO...

2024

[1] [1]

In Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 11772–11817, Vienna, Austria

CaLMQA: Exploring culturally specific long- form question answering across 23 languages. In Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 11772–11817, Vienna, Austria. Association for Computational Linguistics. Brooke Auxier and Monica Anderson. 2021. Social media use in 2021. Tech...

2021

[2] [2]

InThe World Wide Web Conference, WWW ’19, page 49–59, New York, NY , USA

Stereotypical bias removal for hate speech de- tection task using knowledge-based generalizations. InThe World Wide Web Conference, WWW ’19, page 49–59, New York, NY , USA. Association for Com- puting Machinery. Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The Pushshift Reddit dataset. InProceedings of the i...

arXiv 2020

[3] [3]

InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18327–18355, Suzhou, China

STEER-BENCH: A benchmark for evaluating the steerability of large language models. InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18327–18355, Suzhou, China. Association for Computational Lin- guistics. Myra Cheng, Sunny Yu, and Dan Jurafsky. 2025. HumT DumT: Measuring and controlling human-like lan- guag...

2025

[4] [4]

Tanmay Garg, Sarah Masud, Tharun Suresh, and Tan- moy Chakraborty

Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097– 1179. Tanmay Garg, Sarah Masud, Tharun Suresh, and Tan- moy Chakraborty. 2023. Handling bias in toxic speech detection: A survey.ACM Comput. Surv., 55(13s). Xiao Ge, Chunchen Xu, Daigo Misaki, Hazel Rose Markus, and Jeanne L Tsai. 2024. How culture shapes what peop...

Pith/arXiv arXiv 2023

[5] [5]

InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 24–51, Suzhou, China

Break the checkbox: Challenging closed-style evaluations of cultural alignment in LLMs. InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 24–51, Suzhou, China. Association for Computational Lin- guistics. Lucie-Aimée Kaffee, Giada Pistilli, and Yacine Jernite

2025

[6] [6]

InThe Fourteenth International Conference on Learning Representations

INTIMA: A benchmark for human-AI com- panionship behavior. InThe Fourteenth International Conference on Learning Representations. Jared Katzman, Angelina Wang, Morgan Scheuer- man, Su Lin Blodgett, Kristen Laird, Hanna Wal- lach, and Solon Barocas. 2023. Taxonomizing and measuring representational harms: A look at image tagging. InProceedings of the Thirt...

Pith/arXiv arXiv 2023

[7] [7]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5155–5203, Miami, Florida, USA

SEACrowd: A Multilingual Multimodal Data hub and Benchmark Suite for Southeast Asian lan- guages. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5155–5203, Miami, Florida, USA. Association for Computational Linguistics. Takuya Maeda and Anabel Quan-Haase. 2024. When human-AI interactions become parasocial: ...

Pith/arXiv arXiv 2024

[8] [8]

Yixin Wan, Xingrun Chen, and Kai-Wei Chang

Cultural influences on word meanings revealed through large-scale semantic alignment.Nature Hu- man Behaviour, 4(10):1029–1038. Yixin Wan, Xingrun Chen, and Kai-Wei Chang

[9] [9]

Xiaonan Wang, Jinyoung Yeo, Joon-Ho Lim, and Hansaem Kim

InsideOut: Measuring and mitigating Insider- Outsider bias in interview script generation.arXiv preprint arXiv:2509.21080. Xiaonan Wang, Jinyoung Yeo, Joon-Ho Lim, and Hansaem Kim. 2024a. KULTURE bench: A bench- mark for assessing language model in Korean cultural context. InProceedings of the 38th Pacific Asia Con- ference on Language, Information and Co...

Pith/arXiv arXiv 2010

[10] [10]

Haeun Yu, Seogyeong Jeong, Siddhesh Pawar, Jisu Shin, Jiho Jin, Junho Myung, Alice Oh, and Isabelle Augen- stein

Adapting to LLMs: How insiders and out- siders reshape scientific knowledge production. Haeun Yu, Seogyeong Jeong, Siddhesh Pawar, Jisu Shin, Jiho Jin, Junho Myung, Alice Oh, and Isabelle Augen- stein. 2026. Entangled in representations: Mechanis- tic investigation of cultural biases in large language models.arXiv preprint arXiv:2508.08879. Wenlong Zhao, ...

arXiv 2026

[11] [11]

I," "we,

The response uses lexical cues that indicate awareness of the language and culture. Often uses "I," "we," and "us" vs. “they” (outsiders)

[12] [12]

This can include personal experiences vs

The response takes a nuanced perspective rather than displaying surface knowledge. This can include personal experiences vs. ob- jective observations (outsider)

[13] [13]

The response does not attempt to reinforce a Western narrative. B.3.2 Generalizing language Aim:You are given a post and an LLM-generated response, along with an assessment of whether the response seems to be employing generalizing lan- guage or not. Determine the correctness of the assigned label. Description:A response is considered as general- izing ev...

[14] [14]

The presence of a group identity term, such as a cultural or national label

[15] [15]

The grammatical form of the generalization, namely bare plural, indefinite singular, quanti- fied adverb, or definite singular

[16] [16]

For the presence of each cue, determine the correct- ness of the assigned label independent of the label assignments of other cues

Whether the statement attributes enduring characteristics to the group as a whole B.3.3 Anthropomorphisation Aim:You are given a post and an LLM-generated response, along with an assessment of whether the response incorporates a given anthropomorphic cue. For the presence of each cue, determine the correct- ness of the assigned label independent of the la...

[17] [17]

Personal relationships - The response states its relationship in general (familial, romantic, or platonic)

[18] [18]

Personal history - The response attempts to reference personal history, such as physical location, childhood memories, life events, and milestones

[19] [19]

Explicit relationships - The response attempts to establish an explicit reference with the ques- tion seeker (familial, romantic, or platonic)

[20] [20]

Desires - The response expresses a wish to pursue specific actions and ambitions

[21] [21]

Agency - The response explicitly displays the capacity to set and take deliberate action

[22] [22]

Emotions - The response appears emotionally appealing

[23] [23]

Validation - The response attempts to affirm, agree with, or legitimize the question seeker’s opinions, feelings, and experiences

[24] [24]

Empathy - The response demonstrates an un- derstanding of the user’s emotional or per- sonal experience and evokes empathy

[25] [25]

B.3.4 Adhrence to Maxims Aim:You are given a post and an LLM-generated response, along with an assessment of whether the response adheres to communicative principles/max- ims

Relatability - The response appeals to human experiences in the context of the question. B.3.4 Adhrence to Maxims Aim:You are given a post and an LLM-generated response, along with an assessment of whether the response adheres to communicative principles/max- ims. For the presence of each principle, determine the correctness of the assigned label independ...

[26] [26]

• It does not contain exaggerated claims or blatantly state facts as answers

Quality - Assess how genuine and reliable (trustworthy/truthful) the response appears to be, where a response: • It is well explained and appropriately supported or hedged. • It does not contain exaggerated claims or blatantly state facts as answers

[27] [27]

• It should not include unnecessary details beyond what the question requires

Quantity - The response provides sufficient information (neither too much nor too little): • It should not omit critical information needed to fully answer the question. • It should not include unnecessary details beyond what the question requires

[28] [28]

• It does not go off-topic (either beyond what is asked or unrelated)

Relation - Is the response relevant to the con- versation and stays on topic? • It addresses all aspects of the question. • It does not go off-topic (either beyond what is asked or unrelated)

[29] [29]

posts” or “questions

Manner: The response is overall readable (orderly), neutral in tone, brief, and non- ambiguous. • It is easy to understand and appropriately concise. • It is unambiguous or well-organized. C Data Curation In this section, we list the complete set of subred- dit names and the category mapping, along with the category descriptions used for LLM annota- tion....

[30] [30]

Russia:AskARussian; ANormalDayInRus- sia

[31] [31]

India:IndianCinema; IndiaCareers; Indian- MakeupAddicts; IndianRelationships; Incred- ibleIndia; AskIndia; IndiaPlace; IndiaCricket; LegalAdviceIndia; Fitness_India; Indian- HipHopHeads; IndianFood; IndianHistory; IndiansRead; IndianGaming; IndiaNostalgia; IndiaTax; FIREIndia; CryptoIndia; Credit- CardsIndia; IndiaTech; Indian_Academia; In- dianTellyTalk;...

[32] [32]

Philippines:AskPhilippines; JobsPhilip- pines; DragRacePhilippines; Philippines

[33] [33]

Korea:Koreanfilm; Living_in_Korea; Kore- anFood; KoreanBeauty

[34] [34]

6.Turkey:TrapTurkey; AskTurkey; Turkey

China:ChineseHistory; Chinese; AskAChi- nese; ChineseMedicine; ChineseLaserCutters; ChineseLanguage; Chinesetourists. 6.Turkey:TrapTurkey; AskTurkey; Turkey

[35] [35]

USA:AskAnAmerican; FootballAmerica; ANormalDayInAmerica; AskAmericans; Co- paAmerica; CircuitOfTheAmericas; AllAmer- icanTV . Human evaluation of categoriesTwo expert an- notators (one male, one female, aged 25-32 with experience in annotating social media data) inde- pendently review ≈127 non-overlapping samples each, stratified by subreddit, answering t...

[36] [36]

Questions categoriesThe complete list of cate- gories, along with a one-line description provided during zero-shot LLM annotation:

As a result of taxonomy creation and annota- tion, we obtain category mapping per country as described in Figure 11. Questions categoriesThe complete list of cate- gories, along with a one-line description provided during zero-shot LLM annotation:

[37] [37]

Agriculture and vegetation:Farming, plants, cultivation, botanical elements, hor- ticulture

[38] [38]

Animals:Fauna, wildlife, domesticated ani- mals, animal behaviors

[39] [39]

Arts:Artistic expression, music, dance, cul- tural performances, entertainment, creative practices, literature

[40] [40]

Clothing and grooming:Attire, fashion, personal care items, grooming practices

[41] [41]

Current and historical events:Knowledge about historical events, current news

[42] [42]

Education and career:Schooling, educa- tion system, jobs, career paths

[43] [43]

Emotions and values:Feelings, sentiments, Cultural values (e.g., collectivism and individ- ualism), work ethic, modesty

[44] [44]

Food and drinks:Food items, beverages, cooking methods, culinary practices

[45] [45]

Health and Wellness:Tradition and modern health practices, public health issues, well- being, and health infrastructure

[46] [46]

Kinship:Family relationships, lineage, fa- milial connections

[47] [47]

Names:Personal names, place names, iden- tification systems

[48] [48]

Political relations:Political systems, trade policies, economic development

[49] [49]

14.Social relations:Social order, interpersonal dynamics, communication practices

Religious beliefs:Religious entities and practices. 14.Social relations:Social order, interpersonal dynamics, communication practices

[50] [50]

Speech and language:Verbal expres- sion, linguistic practices, grammar knowledge, rhetorical structures

[51] [51]

Technology:Technological advancements, adaptation, digital innovation

[52] [52]

The house:Dwellings, domestic spaces, fur- niture, household items

[53] [53]

Tourism:Traveling, tourist attraction, safety measures, travel trips, climatic conditions

[54] [54]

PromptThe prompt for post categorization is listed in Figure 5

Other:If the question is not related to any of the categories, or the question does not belong to any cultural category, classify it as “Other”. PromptThe prompt for post categorization is listed in Figure 5. D Experimental Setup and Compute For all experiments, we use the vLLM library for efficient inference (Kwon et al., 2023). We eval- uate three instr...

2023

[55] [55]

Analyze the following question carefully

[56] [56]

ONLY one category applies, pick the primary one

Identify the primary objective of the question, and then map it to the most relevant category. ONLY one category applies, pick the primary one

[57] [57]

We do not focus on the implied user intention

Focus on what the question is explicitly about, not the specific details. We do not focus on the implied user intention

[58] [58]

Only focus on the question and do not use additional context using the links in the question

[59] [59]

"" Figure 5: Prompt for zero-shot question categorization into 19 categories. PROMPT =

You MUST enclose your final answer within two hash symbols (##). -------- <Output Format> Enclose the final answer within two hash symbols (##): ##Category## Explanation </Output Format> **Classify the following question based on the above categories:** Question: {question} """ Figure 5: Prompt for zero-shot question categorization into 19 categories. PRO...

2024