VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.
hub
Having Beer after Prayer? Measuring Cultural Bias in Large Language Models
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
years
2026 10representative citing papers
CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.
Analysis of 11 LLMs on 21 disputed inventions across 12 languages and 75,896 responses finds query language systematically shifts credit toward lower-status claimants in their associated language while Anglophone figures remain stable.
KG-FairDiff is an inference-time framework that uses a knowledge graph to guide prompt refinement and reduce gender, race, age, and intersectional biases in text-to-image generation while preserving semantics.
LLMs default to U.S. frameworks for English prompts and China frameworks for Chinese prompts on jurisdiction-underspecified legal-administrative queries, with the pattern holding across all seven tested models.
JuICE is a new multilingual benchmark dataset showing top LLM judges reach only F1 0.52 on span-level cultural error detection and miss errors locals readily spot.
LLMs show omissive bias by underrepresenting religious frameworks in responses to non-religious ethical questions relative to human expectations.
Proposes a three-level taxonomy of Cultural Awareness, Cultural Sensitivity, and Cultural Competence for AI evaluation, grounded in intercultural communication scholarship to improve validity in multicultural contexts.
SCPO is a steerable training method for reward models that improves minority cultural preference accuracy by up to 7 points and is up to 280% more data-efficient than standard finetuning on PRISM and GlobalOpinionQA datasets.
A multilingual self-consistency plus self-critique method raises cultural alignment scores on English queries by 5.03% on the BLEnD benchmark using only self-generated data.
citing papers explorer
-
Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity
VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.
-
CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs
CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.
-
Same question, different history: language, national identity, and credit in large language models
Analysis of 11 LLMs on 21 disputed inventions across 12 languages and 75,896 responses finds query language systematically shifts credit toward lower-status claimants in their associated language while Anglophone figures remain stable.
-
KG-FairDiff: Knowledge Graph-Guided Prompt Refinement for Demographically Fair Text-to-Image Generation
KG-FairDiff is an inference-time framework that uses a knowledge graph to guide prompt refinement and reduce gender, race, age, and intersectional biases in text-to-image generation while preserving semantics.
-
Which Institutional Frameworks Do Chatbots Assume? Auditing Jurisdictional Defaults in Multilingual LLMs
LLMs default to U.S. frameworks for English prompts and China frameworks for Chinese prompts on jurisdiction-underspecified legal-administrative queries, with the pattern holding across all seven tested models.
-
JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors
JuICE is a new multilingual benchmark dataset showing top LLM judges reach only F1 0.52 on span-level cultural error detection and miss errors locals readily spot.
-
Omissive Bias in Religious Representation: Benchmarking LLM Answers to Everyday Ethical Decision-making
LLMs show omissive bias by underrepresenting religious frameworks in responses to non-religious ethical questions relative to human expectations.
-
Defining Cultural Capabilities for AI Evaluation: A Taxonomy Grounded in Intercultural Communication Theory
Proposes a three-level taxonomy of Cultural Awareness, Cultural Sensitivity, and Cultural Competence for AI evaluation, grounded in intercultural communication scholarship to improve validity in multicultural contexts.
-
Steerable Cultural Preference Optimization of Reward Models
SCPO is a steerable training method for reward models that improves minority cultural preference accuracy by up to 7 points and is up to 280% more data-efficient than standard finetuning on PRISM and GlobalOpinionQA datasets.
-
Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency
A multilingual self-consistency plus self-critique method raises cultural alignment scores on English queries by 5.03% on the BLEnD benchmark using only self-generated data.