pith. sign in

arxiv: 2606.02493 · v2 · pith:ZAIRVYC2new · submitted 2026-06-01 · 💻 cs.CL

Not What, But How: A Framework for Auditing LLM Responses across Positioning, Generalization, Anthromorphism, and Maxims

Pith reviewed 2026-06-28 14:18 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM auditingresponse framingcultural positioninganthropomorphismconversational maximsgeneralizationsubjective questionscommunicative audit
0
0 comments X

The pith

FRANZ audits LLM responses to subjective questions on four framing dimensions and finds insider positioning coupled with anthropomorphism at rates that vary by country.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FRANZ as an automated way to examine not only what LLMs say in response to cultural or opinion questions but how they say it. It measures four aspects of communication: whether the model speaks from an insider perspective, uses sweeping generalizations, employs human-like cues, and follows basic conversational rules. The work applies this to hundreds of thousands of real user questions drawn from different countries, scoring outputs from several models. Results show clear differences across models and, more notably, a consistent positive link between insider positioning and anthropomorphism whose strength changes with the country of the question source.

Core claim

FRANZ is an automated framework that scores LLM responses along cultural positioning, use of generalizing language, anthropomorphic cues, and adherence to conversational maxims. When applied to outputs from three open-weight models on the SQUARE corpus of 376k questions from 57 subreddits mapped to seven countries and nineteen categories, the framework detects statistically significant differences in how often each characteristic appears and identifies a positive coupling between insider positioning and anthropomorphism whose strength varies by country.

What carries the argument

FRANZ, the automated scoring framework that quantifies the four communicative dimensions on responses to the SQUARE question corpus.

If this is right

  • Models differ reliably in how frequently they adopt each of the four response characteristics.
  • Insider positioning and anthropomorphism appear together more often than expected by chance.
  • The strength of the positioning-anthropomorphism link is not uniform but depends on the country associated with the question.
  • Multi-dimensional auditing can surface framing patterns that single-metric checks miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The country-specific coupling pattern could be used to flag when a model’s framing diverges from expected local norms.
  • Extending the same audit to additional models or languages would test whether the observed coupling is architecture-dependent or data-dependent.
  • Prompt interventions that reduce one characteristic might unintentionally affect the other because of their measured link.

Load-bearing premise

The automated classifiers inside FRANZ correctly detect and measure the four target response characteristics without large measurement error.

What would settle it

A side-by-side human annotation of several thousand scored responses that shows low agreement with FRANZ labels on the presence or intensity of insider positioning or anthropomorphism.

Figures

Figures reproduced from arXiv: 2606.02493 by Alice Oh, Haneul Yoo, Isabelle Augenstein, Sarah Masud, Siddhesh Milind Pawar.

Figure 1
Figure 1. Figure 1: Overview of the pipeline. LLMs generate responses to subjective questions in SQUARE (a–c; Section 4); FRANZ then scores each response along four characteristics viz. cultural positioning, generalizing language, anthropomorphism, and maxim adherence (d–e; Section 3). turally1 grounded question categories. We employ FRANZ to evaluate responses for SQUARE generated by Llama-3.1-8B-it (Llama), Gemma-3-12b-it (… view at source ↗
Figure 2
Figure 2. Figure 2: Percentage of responses by 3 LLMs on SQUARE judged by FRANZ to exhibit (a) insider positioning, (b) and (c) use of generalizing language, and different anthropomorphic cues, respectively (Section 6.1). where 1/0 denote the presence/absence of a re￾sponse characteristic. We fit one GEE per country￾category subset of the cause-effect pair. RQ3 reuses the statistical techniques from RQ1 and RQ2. For all RQs, … view at source ↗
Figure 3
Figure 3. Figure 3: Computed via Eq. 1, each cell reports # com￾binations across top-categories and LLMs per country where the presence of a cause significantly increases the probability of effect ( blue), decreases it (red), or has no impact (∆ ≈ 0, grey). Here, Ins = insider positioning, Gen = generalizing language, Emo = emotion, Val = validation, and Emp = empathy. Countries are listed by total count across 19 categories … view at source ↗
Figure 4
Figure 4. Figure 4: Percentage of violation (not adherence) per maxims aggregated by country for [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt for zero-shot question categorization into 19 categories. [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt for the insider/outsider positioning judge. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template for the anthropomorphism judge. The scaffold is constant; [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for the generalizing-language judge. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Shared prompt template for the four maxim judges. Only [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overall, 3 of 11 CLIcK (Kim et al., 2024) categories, 1 of 12 CaLMQA categories (Arora et al., 2025), and 10 of 21 Thompson (Thompson et al., 2020) categories do not map to our use case [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Percentage of question category in SQUARE mapped to respective subreddit countries. Qual Ins Quant Ins Rel Ins Man Ins India (9) China (9) Philippines (9) USA (9) Korea (9) Russia (9) Turkey (3) 6 9 5 9 9 6 7 8 5 6 7 6 7 4 9 5 7 7 7 6 5 4 9 5 7 3 8 5 Ins Qual Gen Quant Gen Rel Gen Man Gen 6 7 5 6 8 6 5 5 6 3 7 5 8 1 8 5 6 8 7 7 6 6 4 5 8 3 5 4 Gen Qual Emo Quant Emo Rel Emo Man Emo 5 5 8 6 6 3 7 6 6 3 6 3… view at source ↗
Figure 12
Figure 12. Figure 12: C→E co-occurence (Eq 1). Each cell reports # combinations across top-categories and LLMs (3 ∗ 3) per country, where the presence of a cause significantly – increases the probability of effect (blue), decreases it (red), or has no impact (∆ ≈ 0, grey). Here, Ins = insider positioning, Gen = generalizing language. Anthromorphic cues include Emo = emotion, Val = validation, and Emp = empathy. The maxims are … view at source ↗
Figure 13
Figure 13. Figure 13: India: Computed via Eq. 1, the estimated co-occurence effects of: (a) cultural positioning on anthro￾morphism, (b) anthromorphism on cultural positioning, and (c) maxim of manner on anthromorphism. Each panel shows the change in probability across models for the top-3 categories. Solid lines indicate significant Wald effects, † denotes cross-model differences FDR-corrected, both at p < 0.05 (Sections 6.2 … view at source ↗
Figure 14
Figure 14. Figure 14: China: Computed via Eq. 1, the estimated co-occurence effects of: (a) cultural positioning on anthromorphism, (b) anthromorphism on cultural positioning, and (c) maxim of manner on anthromorphism. Each panel shows the change in probability (∆) across models for the top-3 categories. Solid lines indicate significant Wald effects, † denotes cross-model differences FDR-corrected, both at p < 0.05 (Sections 6… view at source ↗
Figure 15
Figure 15. Figure 15: Philippines: Computed via Eq. 1, the estimated co-occurence effects of: (a) cultural positioning on anthromorphism, (b) anthromorphism on cultural positioning, and (c) maxim of manner on anthromorphism. Each panel shows the change in probability (∆) across models for the top-3 categories. Solid lines indicate significant Wald effects, † denotes cross-model differences FDR-corrected, both at p < 0.05 (Sect… view at source ↗
Figure 16
Figure 16. Figure 16: USA: Computed via Eq. 1, the estimated co-occurence effects of: (a) cultural positioning on anthro￾morphism, (b) anthromorphism on cultural positioning, and (c) maxim of manner on anthromorphism. Each panel shows the change in probability (∆) across models for the top-3 categories. Solid lines indicate significant Wald effects, † denotes cross-model differences FDR-corrected, both at p < 0.05 (Sections 6.… view at source ↗
Figure 17
Figure 17. Figure 17: Korea: Computed via Eq. 1, the estimated co-occurence effects of: (a) cultural positioning on anthromorphism, (b) anthromorphism on cultural positioning, and (c) maxim of manner on anthromorphism. Each panel shows the change in probability (∆) across models for the top-3 categories. Solid lines indicate significant Wald effects, † denotes cross-model differences FDR-corrected, both at p < 0.05 (Sections 6… view at source ↗
Figure 18
Figure 18. Figure 18: Russia: Computed via Eq. 1, the estimated co-occurence effects of: (a) cultural positioning on anthromorphism, (b) anthromorphism on cultural positioning, and (c) maxim of manner on anthromorphism. Each panel shows the change in probability (∆) across models for the top-3 categories. Solid lines indicate significant Wald effects, † denotes cross-model differences FDR-corrected, both at p < 0.05 (Sections … view at source ↗
Figure 19
Figure 19. Figure 19: Turkey: Computed via Eq. 1, the estimated co-occurence effects of: (a) cultural positioning on anthromorphism, (b) anthromorphism on cultural positioning, and (c) maxim of manner on anthromorphism. Each panel shows the change in probability (∆) across models for the top-3 categories. Solid lines indicate significant Wald effects, † denotes cross-model differences FDR-corrected, both at p < 0.05 (Sections … view at source ↗
read the original abstract

Large language models (LLMs) are being increasingly used to answer subjective, information-seeking questions, where users are sensitive to how responses are communicated, not just whether the answers are correct. Existing LLM evaluations for subjective cultural queries largely focus on factual correctness, ignoring how the response is framed. To this end, we introduce FRANZ, an automated FRAmework for respoNse characteriZation to conduct communicative audit of LLM responses along four dimensions: cultural positioning, use of generalizing language, anthropomorphic cues, and adherence to conversational maxims. To enable this evaluation, we contribute SQUARE - a corpus of 376k subjective questions sourced from 57 subreddits, and mapped to 7 countries and 19 question categories. We demonstrate FRANZ's applicability by scoring responses from three open-weight LLMs. We observe that LLMs show statistically significant differences in the frequency with which they employ each response characteristic. Unlike single-dimensional audits, FRANZ reveals that insider positioning and anthropomorphism are positively coupled, with the degree of coupling varying by country, providing a diagnostic lens for identifying framing divergences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces FRANZ, an automated framework for auditing LLM responses to subjective questions along four dimensions (cultural positioning, generalization, anthropomorphism, and conversational maxims). It contributes the SQUARE corpus of 376k questions from 57 subreddits mapped to 7 countries and 19 categories. Applying FRANZ to three open-weight LLMs yields statistically significant differences in characteristic frequencies and reveals a positive coupling between insider positioning and anthropomorphism whose strength varies by country.

Significance. If the automated measurements prove reliable, FRANZ supplies a multi-dimensional diagnostic for LLM framing that single-axis audits miss, and the SQUARE corpus offers a reusable resource for culturally grounded evaluation. The reported coupling provides a concrete, falsifiable pattern that could guide alignment work on subjective queries.

major comments (2)
  1. [FRANZ framework] FRANZ framework (methods section): No accuracy, precision, recall, or inter-annotator agreement figures are reported for the automated classifiers that detect the four dimensions. Because the headline coupling result is produced entirely by these scorers, the absence of human validation leaves open the possibility that shared prompt or model biases mechanically induce the observed correlation.
  2. [SQUARE corpus] SQUARE corpus construction: The abstract states that questions are 'mapped to 7 countries and 19 question categories,' yet provides no validation protocol, agreement metrics, or error analysis for the mapping procedure. Country-specific variation in the coupling therefore rests on an unverified assignment step.
minor comments (1)
  1. The abstract asserts 'statistically significant differences' without naming the tests, correction method, or effect-size thresholds employed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and will incorporate the requested validations into the revised manuscript.

read point-by-point responses
  1. Referee: [FRANZ framework] FRANZ framework (methods section): No accuracy, precision, recall, or inter-annotator agreement figures are reported for the automated classifiers that detect the four dimensions. Because the headline coupling result is produced entirely by these scorers, the absence of human validation leaves open the possibility that shared prompt or model biases mechanically induce the observed correlation.

    Authors: We agree that validation metrics are necessary to support the reliability of the automated classifiers and the coupling result. In the revised manuscript we will add a human validation subsection reporting accuracy, precision, recall, and inter-annotator agreement (Fleiss' kappa) for each of the four dimensions, together with a discussion of how the validation addresses potential scorer biases. revision: yes

  2. Referee: [SQUARE corpus] SQUARE corpus construction: The abstract states that questions are 'mapped to 7 countries and 19 question categories,' yet provides no validation protocol, agreement metrics, or error analysis for the mapping procedure. Country-specific variation in the coupling therefore rests on an unverified assignment step.

    Authors: We acknowledge the absence of reported validation for the mapping step. In the revision we will expand the corpus construction section with the mapping protocol, agreement metrics from multiple annotators, and error analysis on a sampled subset to support the country-specific findings. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework and corpus applied empirically

full rationale

The paper introduces FRANZ as a new automated framework and SQUARE as a new corpus, then applies the framework to score responses from three LLMs. The central observation (positive coupling between insider positioning and anthropomorphism) is an empirical result from this application, not derived from any self-citation chain, fitted parameter renamed as prediction, or self-definitional loop. No equations or derivations are present that reduce the output to the inputs by construction. The work is self-contained against external benchmarks via the new data and tools.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical methods paper that introduces a measurement framework and a dataset; it contains no mathematical derivations, fitted parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5748 in / 1070 out tokens · 30533 ms · 2026-06-28T14:18:55.767398+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 4 linked inside Pith

  1. [1]

    In Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 11772–11817, Vienna, Austria

    CaLMQA: Exploring culturally specific long- form question answering across 23 languages. In Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 11772–11817, Vienna, Austria. Association for Computational Linguistics. Brooke Auxier and Monica Anderson. 2021. Social media use in 2021. Tech...

  2. [2]

    InThe World Wide Web Conference, WWW ’19, page 49–59, New York, NY , USA

    Stereotypical bias removal for hate speech de- tection task using knowledge-based generalizations. InThe World Wide Web Conference, WWW ’19, page 49–59, New York, NY , USA. Association for Com- puting Machinery. Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The Pushshift Reddit dataset. InProceedings of the i...

  3. [3]

    InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18327–18355, Suzhou, China

    STEER-BENCH: A benchmark for evaluating the steerability of large language models. InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18327–18355, Suzhou, China. Association for Computational Lin- guistics. Myra Cheng, Sunny Yu, and Dan Jurafsky. 2025. HumT DumT: Measuring and controlling human-like lan- guag...

  4. [4]

    Tanmay Garg, Sarah Masud, Tharun Suresh, and Tan- moy Chakraborty

    Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097– 1179. Tanmay Garg, Sarah Masud, Tharun Suresh, and Tan- moy Chakraborty. 2023. Handling bias in toxic speech detection: A survey.ACM Comput. Surv., 55(13s). Xiao Ge, Chunchen Xu, Daigo Misaki, Hazel Rose Markus, and Jeanne L Tsai. 2024. How culture shapes what peop...

  5. [5]

    InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 24–51, Suzhou, China

    Break the checkbox: Challenging closed-style evaluations of cultural alignment in LLMs. InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 24–51, Suzhou, China. Association for Computational Lin- guistics. Lucie-Aimée Kaffee, Giada Pistilli, and Yacine Jernite

  6. [6]

    InThe Fourteenth International Conference on Learning Representations

    INTIMA: A benchmark for human-AI com- panionship behavior. InThe Fourteenth International Conference on Learning Representations. Jared Katzman, Angelina Wang, Morgan Scheuer- man, Su Lin Blodgett, Kristen Laird, Hanna Wal- lach, and Solon Barocas. 2023. Taxonomizing and measuring representational harms: A look at image tagging. InProceedings of the Thirt...

  7. [7]

    InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5155–5203, Miami, Florida, USA

    SEACrowd: A Multilingual Multimodal Data hub and Benchmark Suite for Southeast Asian lan- guages. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5155–5203, Miami, Florida, USA. Association for Computational Linguistics. Takuya Maeda and Anabel Quan-Haase. 2024. When human-AI interactions become parasocial: ...

  8. [8]

    Yixin Wan, Xingrun Chen, and Kai-Wei Chang

    Cultural influences on word meanings revealed through large-scale semantic alignment.Nature Hu- man Behaviour, 4(10):1029–1038. Yixin Wan, Xingrun Chen, and Kai-Wei Chang

  9. [9]

    Xiaonan Wang, Jinyoung Yeo, Joon-Ho Lim, and Hansaem Kim

    InsideOut: Measuring and mitigating Insider- Outsider bias in interview script generation.arXiv preprint arXiv:2509.21080. Xiaonan Wang, Jinyoung Yeo, Joon-Ho Lim, and Hansaem Kim. 2024a. KULTURE bench: A bench- mark for assessing language model in Korean cultural context. InProceedings of the 38th Pacific Asia Con- ference on Language, Information and Co...

  10. [10]

    Haeun Yu, Seogyeong Jeong, Siddhesh Pawar, Jisu Shin, Jiho Jin, Junho Myung, Alice Oh, and Isabelle Augen- stein

    Adapting to LLMs: How insiders and out- siders reshape scientific knowledge production. Haeun Yu, Seogyeong Jeong, Siddhesh Pawar, Jisu Shin, Jiho Jin, Junho Myung, Alice Oh, and Isabelle Augen- stein. 2026. Entangled in representations: Mechanis- tic investigation of cultural biases in large language models.arXiv preprint arXiv:2508.08879. Wenlong Zhao, ...

  11. [11]

    I," "we,

    The response uses lexical cues that indicate awareness of the language and culture. Often uses "I," "we," and "us" vs. “they” (outsiders)

  12. [12]

    This can include personal experiences vs

    The response takes a nuanced perspective rather than displaying surface knowledge. This can include personal experiences vs. ob- jective observations (outsider)

  13. [13]

    The response does not attempt to reinforce a Western narrative. B.3.2 Generalizing language Aim:You are given a post and an LLM-generated response, along with an assessment of whether the response seems to be employing generalizing lan- guage or not. Determine the correctness of the assigned label. Description:A response is considered as general- izing ev...

  14. [14]

    The presence of a group identity term, such as a cultural or national label

  15. [15]

    The grammatical form of the generalization, namely bare plural, indefinite singular, quanti- fied adverb, or definite singular

  16. [16]

    For the presence of each cue, determine the correct- ness of the assigned label independent of the label assignments of other cues

    Whether the statement attributes enduring characteristics to the group as a whole B.3.3 Anthropomorphisation Aim:You are given a post and an LLM-generated response, along with an assessment of whether the response incorporates a given anthropomorphic cue. For the presence of each cue, determine the correct- ness of the assigned label independent of the la...

  17. [17]

    Personal relationships - The response states its relationship in general (familial, romantic, or platonic)

  18. [18]

    Personal history - The response attempts to reference personal history, such as physical location, childhood memories, life events, and milestones

  19. [19]

    Explicit relationships - The response attempts to establish an explicit reference with the ques- tion seeker (familial, romantic, or platonic)

  20. [20]

    Desires - The response expresses a wish to pursue specific actions and ambitions

  21. [21]

    Agency - The response explicitly displays the capacity to set and take deliberate action

  22. [22]

    Emotions - The response appears emotionally appealing

  23. [23]

    Validation - The response attempts to affirm, agree with, or legitimize the question seeker’s opinions, feelings, and experiences

  24. [24]

    Empathy - The response demonstrates an un- derstanding of the user’s emotional or per- sonal experience and evokes empathy

  25. [25]

    B.3.4 Adhrence to Maxims Aim:You are given a post and an LLM-generated response, along with an assessment of whether the response adheres to communicative principles/max- ims

    Relatability - The response appeals to human experiences in the context of the question. B.3.4 Adhrence to Maxims Aim:You are given a post and an LLM-generated response, along with an assessment of whether the response adheres to communicative principles/max- ims. For the presence of each principle, determine the correctness of the assigned label independ...

  26. [26]

    • It does not contain exaggerated claims or blatantly state facts as answers

    Quality - Assess how genuine and reliable (trustworthy/truthful) the response appears to be, where a response: • It is well explained and appropriately supported or hedged. • It does not contain exaggerated claims or blatantly state facts as answers

  27. [27]

    • It should not include unnecessary details beyond what the question requires

    Quantity - The response provides sufficient information (neither too much nor too little): • It should not omit critical information needed to fully answer the question. • It should not include unnecessary details beyond what the question requires

  28. [28]

    • It does not go off-topic (either beyond what is asked or unrelated)

    Relation - Is the response relevant to the con- versation and stays on topic? • It addresses all aspects of the question. • It does not go off-topic (either beyond what is asked or unrelated)

  29. [29]

    posts” or “questions

    Manner: The response is overall readable (orderly), neutral in tone, brief, and non- ambiguous. • It is easy to understand and appropriately concise. • It is unambiguous or well-organized. C Data Curation In this section, we list the complete set of subred- dit names and the category mapping, along with the category descriptions used for LLM annota- tion....

  30. [30]

    Russia:AskARussian; ANormalDayInRus- sia

  31. [31]

    India:IndianCinema; IndiaCareers; Indian- MakeupAddicts; IndianRelationships; Incred- ibleIndia; AskIndia; IndiaPlace; IndiaCricket; LegalAdviceIndia; Fitness_India; Indian- HipHopHeads; IndianFood; IndianHistory; IndiansRead; IndianGaming; IndiaNostalgia; IndiaTax; FIREIndia; CryptoIndia; Credit- CardsIndia; IndiaTech; Indian_Academia; In- dianTellyTalk;...

  32. [32]

    Philippines:AskPhilippines; JobsPhilip- pines; DragRacePhilippines; Philippines

  33. [33]

    Korea:Koreanfilm; Living_in_Korea; Kore- anFood; KoreanBeauty

  34. [34]

    6.Turkey:TrapTurkey; AskTurkey; Turkey

    China:ChineseHistory; Chinese; AskAChi- nese; ChineseMedicine; ChineseLaserCutters; ChineseLanguage; Chinesetourists. 6.Turkey:TrapTurkey; AskTurkey; Turkey

  35. [35]

    USA:AskAnAmerican; FootballAmerica; ANormalDayInAmerica; AskAmericans; Co- paAmerica; CircuitOfTheAmericas; AllAmer- icanTV . Human evaluation of categoriesTwo expert an- notators (one male, one female, aged 25-32 with experience in annotating social media data) inde- pendently review ≈127 non-overlapping samples each, stratified by subreddit, answering t...

  36. [36]

    Questions categoriesThe complete list of cate- gories, along with a one-line description provided during zero-shot LLM annotation:

    As a result of taxonomy creation and annota- tion, we obtain category mapping per country as described in Figure 11. Questions categoriesThe complete list of cate- gories, along with a one-line description provided during zero-shot LLM annotation:

  37. [37]

    Agriculture and vegetation:Farming, plants, cultivation, botanical elements, hor- ticulture

  38. [38]

    Animals:Fauna, wildlife, domesticated ani- mals, animal behaviors

  39. [39]

    Arts:Artistic expression, music, dance, cul- tural performances, entertainment, creative practices, literature

  40. [40]

    Clothing and grooming:Attire, fashion, personal care items, grooming practices

  41. [41]

    Current and historical events:Knowledge about historical events, current news

  42. [42]

    Education and career:Schooling, educa- tion system, jobs, career paths

  43. [43]

    Emotions and values:Feelings, sentiments, Cultural values (e.g., collectivism and individ- ualism), work ethic, modesty

  44. [44]

    Food and drinks:Food items, beverages, cooking methods, culinary practices

  45. [45]

    Health and Wellness:Tradition and modern health practices, public health issues, well- being, and health infrastructure

  46. [46]

    Kinship:Family relationships, lineage, fa- milial connections

  47. [47]

    Names:Personal names, place names, iden- tification systems

  48. [48]

    Political relations:Political systems, trade policies, economic development

  49. [49]

    14.Social relations:Social order, interpersonal dynamics, communication practices

    Religious beliefs:Religious entities and practices. 14.Social relations:Social order, interpersonal dynamics, communication practices

  50. [50]

    Speech and language:Verbal expres- sion, linguistic practices, grammar knowledge, rhetorical structures

  51. [51]

    Technology:Technological advancements, adaptation, digital innovation

  52. [52]

    The house:Dwellings, domestic spaces, fur- niture, household items

  53. [53]

    Tourism:Traveling, tourist attraction, safety measures, travel trips, climatic conditions

  54. [54]

    PromptThe prompt for post categorization is listed in Figure 5

    Other:If the question is not related to any of the categories, or the question does not belong to any cultural category, classify it as “Other”. PromptThe prompt for post categorization is listed in Figure 5. D Experimental Setup and Compute For all experiments, we use the vLLM library for efficient inference (Kwon et al., 2023). We eval- uate three instr...

  55. [55]

    Analyze the following question carefully

  56. [56]

    ONLY one category applies, pick the primary one

    Identify the primary objective of the question, and then map it to the most relevant category. ONLY one category applies, pick the primary one

  57. [57]

    We do not focus on the implied user intention

    Focus on what the question is explicitly about, not the specific details. We do not focus on the implied user intention

  58. [58]

    Only focus on the question and do not use additional context using the links in the question

  59. [59]

    "" Figure 5: Prompt for zero-shot question categorization into 19 categories. PROMPT =

    You MUST enclose your final answer within two hash symbols (##). -------- <Output Format> Enclose the final answer within two hash symbols (##): ##Category## Explanation </Output Format> **Classify the following question based on the above categories:** Question: {question} """ Figure 5: Prompt for zero-shot question categorization into 19 categories. PRO...