hub

Having Beer after Prayer? Measuring Cultural Bias in Large Language Models

Tarek Naous, Michael J Ryan, Alan Ritter, Wei Xu · 2024 · DOI 10.18653/v1/2024.acl-long.862

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

open at publisher browse 10 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

representative citing papers

Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity

cs.CV · 2026-06-12 · unverdicted · novelty 7.0

VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.

CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.

Same question, different history: language, national identity, and credit in large language models

cs.CL · 2026-06-22 · unverdicted · novelty 6.0

Analysis of 11 LLMs on 21 disputed inventions across 12 languages and 75,896 responses finds query language systematically shifts credit toward lower-status claimants in their associated language while Anglophone figures remain stable.

KG-FairDiff: Knowledge Graph-Guided Prompt Refinement for Demographically Fair Text-to-Image Generation

cs.CV · 2026-05-31 · unverdicted · novelty 6.0

KG-FairDiff is an inference-time framework that uses a knowledge graph to guide prompt refinement and reduce gender, race, age, and intersectional biases in text-to-image generation while preserving semantics.

Which Institutional Frameworks Do Chatbots Assume? Auditing Jurisdictional Defaults in Multilingual LLMs

cs.CL · 2026-05-29 · conditional · novelty 6.0

LLMs default to U.S. frameworks for English prompts and China frameworks for Chinese prompts on jurisdiction-underspecified legal-administrative queries, with the pattern holding across all seven tested models.

JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors

cs.CL · 2026-05-26 · unverdicted · novelty 6.0

JuICE is a new multilingual benchmark dataset showing top LLM judges reach only F1 0.52 on span-level cultural error detection and miss errors locals readily spot.

Omissive Bias in Religious Representation: Benchmarking LLM Answers to Everyday Ethical Decision-making

cs.LG · 2026-05-23 · unverdicted · novelty 6.0

LLMs show omissive bias by underrepresenting religious frameworks in responses to non-religious ethical questions relative to human expectations.

Defining Cultural Capabilities for AI Evaluation: A Taxonomy Grounded in Intercultural Communication Theory

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

Proposes a three-level taxonomy of Cultural Awareness, Cultural Sensitivity, and Cultural Competence for AI evaluation, grounded in intercultural communication scholarship to improve validity in multicultural contexts.

Steerable Cultural Preference Optimization of Reward Models

cs.CL · 2026-06-17 · unverdicted · novelty 5.0

SCPO is a steerable training method for reward models that improves minority cultural preference accuracy by up to 7 points and is up to 280% more data-efficient than standard finetuning on PRISM and GlobalOpinionQA datasets.

Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency

cs.CL · 2026-05-21 · unverdicted · novelty 5.0 · 2 refs

A multilingual self-consistency plus self-critique method raises cultural alignment scores on English queries by 5.03% on the BLEnD benchmark using only self-generated data.

citing papers explorer

Showing 10 of 10 citing papers.

Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity cs.CV · 2026-06-12 · unverdicted · none · ref 17
VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.
CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs cs.CL · 2026-06-01 · unverdicted · none · ref 28
CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.
Same question, different history: language, national identity, and credit in large language models cs.CL · 2026-06-22 · unverdicted · none · ref 28
Analysis of 11 LLMs on 21 disputed inventions across 12 languages and 75,896 responses finds query language systematically shifts credit toward lower-status claimants in their associated language while Anglophone figures remain stable.
KG-FairDiff: Knowledge Graph-Guided Prompt Refinement for Demographically Fair Text-to-Image Generation cs.CV · 2026-05-31 · unverdicted · none · ref 42
KG-FairDiff is an inference-time framework that uses a knowledge graph to guide prompt refinement and reduce gender, race, age, and intersectional biases in text-to-image generation while preserving semantics.
Which Institutional Frameworks Do Chatbots Assume? Auditing Jurisdictional Defaults in Multilingual LLMs cs.CL · 2026-05-29 · conditional · none · ref 16
LLMs default to U.S. frameworks for English prompts and China frameworks for Chinese prompts on jurisdiction-underspecified legal-administrative queries, with the pattern holding across all seven tested models.
JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors cs.CL · 2026-05-26 · unverdicted · none · ref 24
JuICE is a new multilingual benchmark dataset showing top LLM judges reach only F1 0.52 on span-level cultural error detection and miss errors locals readily spot.
Omissive Bias in Religious Representation: Benchmarking LLM Answers to Everyday Ethical Decision-making cs.LG · 2026-05-23 · unverdicted · none · ref 22
LLMs show omissive bias by underrepresenting religious frameworks in responses to non-religious ethical questions relative to human expectations.
Defining Cultural Capabilities for AI Evaluation: A Taxonomy Grounded in Intercultural Communication Theory cs.CL · 2026-05-15 · unverdicted · none · ref 24
Proposes a three-level taxonomy of Cultural Awareness, Cultural Sensitivity, and Cultural Competence for AI evaluation, grounded in intercultural communication scholarship to improve validity in multicultural contexts.
Steerable Cultural Preference Optimization of Reward Models cs.CL · 2026-06-17 · unverdicted · none · ref 8
SCPO is a steerable training method for reward models that improves minority cultural preference accuracy by up to 7 points and is up to 280% more data-efficient than standard finetuning on PRISM and GlobalOpinionQA datasets.
Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency cs.CL · 2026-05-21 · unverdicted · none · ref 20 · 2 links
A multilingual self-consistency plus self-critique method raises cultural alignment scores on English queries by 5.03% on the BLEnD benchmark using only self-generated data.

Having Beer after Prayer? Measuring Cultural Bias in Large Language Models

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer