CultureScore: Evaluating Cultural Faithfulness in Video Generation Models

Anku Rani; Mahdi M. Kalayeh; Pattie Maes; Paul Pu Liang; Shravan Nayak; Wei Dai

arxiv: 2606.07311 · v2 · pith:C7NZXDKFnew · submitted 2026-06-05 · 💻 cs.CV · cs.AI

CultureScore: Evaluating Cultural Faithfulness in Video Generation Models

Anku Rani , Wei Dai , Shravan Nayak , Pattie Maes , Mahdi M. Kalayeh , Paul Pu Liang This is my paper

Pith reviewed 2026-06-27 22:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video generationcultural faithfulnessCultureScoreevaluation frameworkmultimodal modelsbehavioral normsglobal culturesAI fairness

0 comments

The pith

No current video generation model achieves culturally faithful outputs, topping out at 56.8 percent on CultureScore.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CultureScore to assess whether video models accurately represent cultures from different parts of the world. It splits the assessment into three parts: who is shown in the video, the cultural setting around them, and the gestures or actions they perform. Tests on three leading models using prompts tied to ten countries produce thousands of videos and show that even the strongest model falls well short, especially on actions and interactions. Human viewers' preferences line up with these cultural scores rather than with existing measures of visual quality alone. The framework therefore supplies a concrete way to track progress toward more equitable video generation.

Core claim

CultureScore decomposes cultural faithfulness into Identity, Context, and Behavior and applies the framework to 6,174 videos generated by three state-of-the-art models across an evaluation suite of ten countries, revealing that the highest overall score is 56.8 percent while Behavior remains below 52.1 percent for every model; human preference rankings match CultureScore directionally yet invert relative to VideoScore rankings.

What carries the argument

CultureScore, the compositional framework that evaluates cultural faithfulness along the three dimensions of Identity, Context, and Behavior using a fixed 10-country prompt suite.

If this is right

Video generation pipelines must improve their handling of normative gestures and social interactions to raise CultureScore.
Benchmarks for these models should combine cultural faithfulness measures with visual quality scores rather than relying on the latter alone.
Human preference data already indicate that cultural accuracy affects perceived quality more than current visual metrics capture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The low Behavior scores may trace back to under-representation of diverse interaction patterns in training data.
Expanding the country set or adding new dimensions could reveal whether the current gaps generalize or are specific to the tested suite.

Load-bearing premise

The three dimensions of Identity, Context, and Behavior together with the 10-country prompt set give a valid and unbiased measure of cultural faithfulness.

What would settle it

A new model that scores above 70 percent overall on CultureScore, or a human study in which annotators consistently prefer the highest-VideoScore model over the highest-CultureScore model, would falsify the central claims.

Figures

Figures reproduced from arXiv: 2606.07311 by Anku Rani, Mahdi M. Kalayeh, Pattie Maes, Paul Pu Liang, Shravan Nayak, Wei Dai.

**Figure 1.** Figure 1: CultureScore is a new compositional evaluation framework that decomposes cultural faithfulness into [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The CultureScore evaluation framework. Base prompts are decomposed into Identity, Behavior, and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The Base Prompt (left) includes an explicit geographic identifier (“Muslim Indian family enjoying a halal feast at home”), producing a culturally grounded scene. The Extended Prompt (center) augments the scene with decomposed Identity, Behavior, and Context descriptions, yielding richer cultural detail, such as Islamic decor, white kurta attire, etc. The ´ Geographical Constraint Removed Prompt (right) str… view at source ↗

**Figure 4.** Figure 4: Average CultureScore (%) across Identity, [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: The inverse relationship between the aver [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Average CultureScore (%) for base and extended prompt across the models. Extended prompts consistently [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Average CultureScore of LTX-2 and Veo 3.1 [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Correlation between CultureScore and native human preference rankings, across all nine countries. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Heatmaps showing similarity between countries across different cultural categories. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Annotation Platform-Humans annotated samples based on Question Relevance, Answer Accuracy, and [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

As video generation models like Veo 3.1 and LTX-2 advance, their ability to accurately represent diverse global cultures remains a critical yet understudied frontier. Current metrics, such as VideoScore, only measure visual quality but offer no mechanism for assessing cultural faithfulness. Consequently, a model that replaces a Namaste with a handshake receives the same score as one that generates the gesture correctly. We propose CultureScore, a compositional evaluation framework that decomposes cultural faithfulness into three granular dimensions: Identity (who is represented), Context (culturally localized background), and Behavior (normative gestures and interactions). We operationalize this framework through an evaluation suite spanning 10 countries, yielding 6,174 generated videos across three state-of-the-art models. Our evaluation reveals that no current model achieves culturally faithful video generation: the best-performing model reaches only 56.8\% overall CultureScore, with Behavior the most challenging dimension, which remains below 52.1\% across all models. Furthermore, human preference rankings align directionally with CultureScore but are inverted relative to VideoScore; the highest-scoring model on visual quality was ranked last by annotators, underscoring that cultural faithfulness is an essential criterion for equitable video generation. Data and code are available at https://huggingface.co/datasets/ankurani/CultureScore.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CultureScore adds a useful three-way breakdown for cultural faithfulness in video gen that VideoScore misses, but the headline numbers on model failure rest on unverified details about prompt sampling and annotator backgrounds.

read the letter

The main contribution is the decomposition of cultural faithfulness into Identity, Context, and Behavior, plus a 10-country evaluation set that produces 6,174 videos from three models. This moves past visual-quality-only scores and shows that the best model still only reaches 56.8% overall, with Behavior stuck below 52.1%. The paper also reports that human preference rankings line up with CultureScore but reverse relative to VideoScore, which is a concrete finding worth having.

The work does a few things right. It releases the dataset and code on Hugging Face, which lets others check or extend the suite. The framing is straightforward and the gap versus existing metrics is clear from the abstract.

The soft spots sit in the human study. The stress-test concern about whether the prompts are representative and whether annotators match the 10 countries in background is real; without those details the Behavior and Context scores could reflect annotator priors more than actual cultural norms. Inter-annotator agreement and calibration numbers are also needed to judge how stable the percentages are. If the full paper supplies those controls and shows the sampling process, the claims strengthen; if not, the inversion result stays hard to generalize.

This paper is for people working on evaluation metrics and fairness in generative video. It deserves a serious referee because the problem is timely, the framework is new, and the public data lowers the barrier to follow-up work, even though the human-evaluation section will likely need tightening.

Referee Report

3 major / 2 minor

Summary. The paper introduces CultureScore, a compositional framework that decomposes cultural faithfulness in video generation into three dimensions (Identity, Context, Behavior) and applies it to an evaluation suite of prompts spanning 10 countries. It generates 6,174 videos from three state-of-the-art models, reports that the best model reaches only 56.8% overall CultureScore (with Behavior below 52.1% for all models), and shows that CultureScore rankings align with human preferences while inverting those from VideoScore.

Significance. If the evaluation suite and annotation protocol are shown to be representative and unbiased, the work identifies a clear gap in current video models' ability to handle cultural content and demonstrates that visual-quality metrics alone are insufficient. The public release of data and code strengthens the contribution by enabling direct replication and extension.

major comments (3)

[§3.2] §3.2 (Evaluation Suite): The paper does not detail the sampling procedure used to construct the 6,174 prompts across the 10 countries (e.g., stratification by topic, frequency of cultural elements, or exclusion criteria), so it is impossible to determine whether the low scores reflect genuine model shortcomings or selection bias in the prompt set.
[§4.2] §4.2 (Annotation Protocol): No information is provided on annotator recruitment, cultural origin, or expertise matching the 10 countries; without this, scores on Behavior and Context may simply capture Western annotator priors rather than culturally grounded judgments.
[§5] §5 (Results): Inter-annotator agreement statistics (e.g., Fleiss' kappa per dimension) and any calibration or bias-mitigation procedures are absent, so the headline figures (56.8% overall, <52.1% Behavior) cannot be assessed for reliability or used to support the claim that no model achieves cultural faithfulness.

minor comments (2)

[Abstract] The abstract states that human preference rankings align directionally with CultureScore; the corresponding table or figure should explicitly report the rank correlation coefficient.
[§3.1] Notation for the three dimensions is introduced without a concise formal definition (e.g., a short equation or decision tree) that annotators could reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on improving the transparency and rigor of our evaluation methodology. We address each major comment below and commit to revisions that enhance the manuscript without altering our core findings.

read point-by-point responses

Referee: [§3.2] §3.2 (Evaluation Suite): The paper does not detail the sampling procedure used to construct the 6,174 prompts across the 10 countries (e.g., stratification by topic, frequency of cultural elements, or exclusion criteria), so it is impossible to determine whether the low scores reflect genuine model shortcomings or selection bias in the prompt set.

Authors: We agree that explicit details on prompt construction are required to assess representativeness. In the revised manuscript, we will expand §3.2 with a full description of the sampling procedure, including stratification by country and topic, selection of cultural elements, and exclusion criteria. This addition will directly address concerns about potential selection bias. revision: yes
Referee: [§4.2] §4.2 (Annotation Protocol): No information is provided on annotator recruitment, cultural origin, or expertise matching the 10 countries; without this, scores on Behavior and Context may simply capture Western annotator priors rather than culturally grounded judgments.

Authors: We acknowledge that annotator background information is essential for validating cultural judgments. We will revise §4.2 to include details on annotator recruitment, cultural origins, and relevant expertise for the 10 countries, along with any procedures used to ensure alignment with local cultural contexts. This will clarify that judgments were not solely based on external priors. revision: yes
Referee: [§5] §5 (Results): Inter-annotator agreement statistics (e.g., Fleiss' kappa per dimension) and any calibration or bias-mitigation procedures are absent, so the headline figures (56.8% overall, <52.1% Behavior) cannot be assessed for reliability or used to support the claim that no model achieves cultural faithfulness.

Authors: We agree that inter-annotator agreement and bias mitigation details are necessary to support the reliability of the reported scores. In the revised manuscript, we will report Fleiss' kappa (or equivalent) per dimension in §5, along with descriptions of calibration procedures and bias-mitigation steps. These additions will allow readers to evaluate the robustness of the 56.8% and <52.1% figures. revision: yes

Circularity Check

0 steps flagged

No circularity: CultureScore is an independently defined metric applied to external model outputs.

full rationale

The paper defines CultureScore as a new compositional framework with three dimensions (Identity, Context, Behavior) and an evaluation suite of 6,174 videos from three external models across 10 countries. No equations, derivations, fitted parameters, or predictions appear. The central results (e.g., max 56.8% CultureScore) are direct outputs of applying the defined metric to model generations, with no reduction to self-citation chains or input-by-construction. This matches the reader's assessment of score 2.0 and contains no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no free parameters, axioms, or invented entities; the framework is presented as a direct decomposition without additional fitted quantities or postulated constructs.

pith-pipeline@v0.9.1-grok · 5786 in / 1096 out tokens · 25849 ms · 2026-06-27T22:26:17.697478+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 3 canonical work pages · 3 internal anchors

[1]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8734–8743

Grid diffusion models for text-to-video gener- ation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8734–8743. Sina Malakouti, Boqing Gong, and Adriana Kovashka
[2]

Scalable Diffusion Models with Transformers

Culture in action: Evaluating text-to-image models through social activities. InThe Fourteenth International Conference on Learning Representa- tions. Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd Van Steenkiste, Yash Goyal, Karolina Stanczak, and Aishwarya Agrawal. 2025. CulturalFrames: Assessing cultural expecta...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Towards Accurate Generative Models of Video: A New Metric & Challenges

VF-eval: Evaluating multimodal LLMs for generating feedback on AIGC videos. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21126–21146, Vienna, Austria. Association for Computational Linguistics. Charles Spearman. 1961. The proof and measurement of association between two things. Th...

work page internal anchor Pith review Pith/arXiv arXiv 1961
[4]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan: Open and advanced large-scale video generative models.Preprint, arXiv:2503.20314. Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, and Nong Sang. 2024. A recipe for scaling up text-to-video generation with text-free videos. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Yes/No Format:Every question must be structured so that a ”Yes” indicates cultural faithfulness and a ”No” indicates a failure or cultural inaccuracy
[6]

Do not assume the evaluating model has implicit cultural knowledge

Embedded Visual Descriptions:Embed precise visual, physical, or spatial descriptions directly into the question. Do not assume the evaluating model has implicit cultural knowledge. (e.g., Instead of ”Is the person wearing a traditional Kimono?”, ask ”Is the person wearing a traditional Kimono, characterized by left-over-right wrapped lapels and wide, squa...
[7]

(e.g., ”Does the pouring Behavior begin with the vessel held low, smoothly rise to a higher elevation, and return low without breaking the stream?”)

Temporal Grounding (Crucial for Behaviors):Questions MUST explicitly probe the sequence, duration, physics, or progression of the movement across frames. (e.g., ”Does the pouring Behavior begin with the vessel held low, smoothly rise to a higher elevation, and return low without breaking the stream?”)
[8]

grounded

Explicit vs. Implicit:Generate questions for explicitly requested elements, AND mandatory implicit elements required for cultural authenticity. Actively avoid Western-centric stereotypes (e.g., implicitly checking that a traditional daily Context avoids hyper-exoticized or religious backdrops unless the Context or the prompt requires them). 5.Weighting St...
[9]

Identify the visual evidence:Describe exactly what you observe in the video frames—specific Identitys, clothing details, spatial arrangements, architectural elements, lighting, and colors
[10]

Do not rely on implicit cultural knowledge—only evaluate what the question explicitly describes

Assess cultural accuracy:Compare your observations against the culturally specific visual descriptions embedded in the question. Do not rely on implicit cultural knowledge—only evaluate what the question explicitly describes
[11]

Note whether Behaviors follow the temporal grounding specified in the question

Evaluate temporal and physical coherence (for Behavior questions):Examine the sequence, duration, physics, and progression of movements across frames. Note whether Behaviors follow the temporal grounding specified in the question
[12]

culturally unique

Check for stereotyping or inauthenticity:Flag if the video substitutes Western-centric defaults, hyper-exoticized elements, or generic representations in place of the specific cultural markers described in the question. After your reasoning, provide the final answer as either Yes or No. ”Yes” means the video is culturally faithful for what the question as...

2025

[1] [1]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8734–8743

Grid diffusion models for text-to-video gener- ation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8734–8743. Sina Malakouti, Boqing Gong, and Adriana Kovashka

[2] [2]

Scalable Diffusion Models with Transformers

Culture in action: Evaluating text-to-image models through social activities. InThe Fourteenth International Conference on Learning Representa- tions. Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd Van Steenkiste, Yash Goyal, Karolina Stanczak, and Aishwarya Agrawal. 2025. CulturalFrames: Assessing cultural expecta...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Towards Accurate Generative Models of Video: A New Metric & Challenges

VF-eval: Evaluating multimodal LLMs for generating feedback on AIGC videos. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21126–21146, Vienna, Austria. Association for Computational Linguistics. Charles Spearman. 1961. The proof and measurement of association between two things. Th...

work page internal anchor Pith review Pith/arXiv arXiv 1961

[4] [4]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan: Open and advanced large-scale video generative models.Preprint, arXiv:2503.20314. Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, and Nong Sang. 2024. A recipe for scaling up text-to-video generation with text-free videos. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Yes/No Format:Every question must be structured so that a ”Yes” indicates cultural faithfulness and a ”No” indicates a failure or cultural inaccuracy

[6] [6]

Do not assume the evaluating model has implicit cultural knowledge

Embedded Visual Descriptions:Embed precise visual, physical, or spatial descriptions directly into the question. Do not assume the evaluating model has implicit cultural knowledge. (e.g., Instead of ”Is the person wearing a traditional Kimono?”, ask ”Is the person wearing a traditional Kimono, characterized by left-over-right wrapped lapels and wide, squa...

[7] [7]

(e.g., ”Does the pouring Behavior begin with the vessel held low, smoothly rise to a higher elevation, and return low without breaking the stream?”)

Temporal Grounding (Crucial for Behaviors):Questions MUST explicitly probe the sequence, duration, physics, or progression of the movement across frames. (e.g., ”Does the pouring Behavior begin with the vessel held low, smoothly rise to a higher elevation, and return low without breaking the stream?”)

[8] [8]

grounded

Explicit vs. Implicit:Generate questions for explicitly requested elements, AND mandatory implicit elements required for cultural authenticity. Actively avoid Western-centric stereotypes (e.g., implicitly checking that a traditional daily Context avoids hyper-exoticized or religious backdrops unless the Context or the prompt requires them). 5.Weighting St...

[9] [9]

Identify the visual evidence:Describe exactly what you observe in the video frames—specific Identitys, clothing details, spatial arrangements, architectural elements, lighting, and colors

[10] [10]

Do not rely on implicit cultural knowledge—only evaluate what the question explicitly describes

Assess cultural accuracy:Compare your observations against the culturally specific visual descriptions embedded in the question. Do not rely on implicit cultural knowledge—only evaluate what the question explicitly describes

[11] [11]

Note whether Behaviors follow the temporal grounding specified in the question

Evaluate temporal and physical coherence (for Behavior questions):Examine the sequence, duration, physics, and progression of movements across frames. Note whether Behaviors follow the temporal grounding specified in the question

[12] [12]

culturally unique

Check for stereotyping or inauthenticity:Flag if the video substitutes Western-centric defaults, hyper-exoticized elements, or generic representations in place of the specific cultural markers described in the question. After your reasoning, provide the final answer as either Yes or No. ”Yes” means the video is culturally faithful for what the question as...

2025