ExCAM: Explainable Cultural Awareness Metrics

Christoph Leiter; Haiyue Song; Hideki Tanaka; Hour Kaing; Jin Tei; Masao Utiyama; Steffen Eger

arxiv: 2605.29897 · v1 · pith:66FLMAEEnew · submitted 2026-05-28 · 💻 cs.CL

ExCAM: Explainable Cultural Awareness Metrics

Christoph Leiter , Haiyue Song , Hour Kaing , Jin Tei , Hideki Tanaka , Masao Utiyama , Steffen Eger This is my paper

Pith reviewed 2026-06-29 07:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords explainable cultural awarenessLLM metricscultural errorserror detectionsynthetic datasetfree text evaluationcultural fairness

0 comments

The pith

ExCAM is the first metric to identify, rate, and explain cultural errors in LLM instruction-output pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents ExCAM, a metric designed to evaluate the cultural awareness of large language models in free text generation. Creating cultural benchmarks has been expensive due to human annotations, and explainable metrics for free text are rare. The authors build ExCAM40k by enhancing nine existing benchmarks with synthetic errors and train ExCAM on it. ExCAM outperforms baselines including GPT-5, reaching 80% accuracy in detecting errors on a balanced test set. This development supports more accessible and transparent evaluation of cultural fairness in AI outputs.

Core claim

ExCAM is, to our knowledge, the first dedicated evaluation metric that identifies, rates and explains cultural errors in instruction-output pairs. To train and evaluate ExCAM, we introduce ExCAM40k, a dataset comprised of nine existing benchmarks that we reformat and enhance with synthetic errors. Compared to several baselines, including GPT-5, ExCAM achieves the highest error detection rate with up to 80% accuracy on a balanced test set. Therefore, ExCAM opens the pathway towards fine-grained and explainable cultural evaluation of free text.

What carries the argument

ExCAM, the explainable cultural awareness metric for detecting and explaining cultural errors in LLM outputs.

If this is right

Enables fine-grained evaluation of cultural awareness beyond question answering.
Provides explanations for detected errors to aid understanding.
Lowers the barrier to creating cultural evaluation benchmarks.
Supports generalizability of LLM applications across cultures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method of augmenting benchmarks with synthetic errors could be used for other types of model evaluation.
ExCAM might be adapted to evaluate cultural awareness in languages other than those in the dataset.
Future work could test if using ExCAM during training improves model performance on cultural tasks.

Load-bearing premise

The synthetic errors added to create the ExCAM40k dataset from existing benchmarks accurately represent real cultural errors that LLMs make in free text generation tasks.

What would settle it

A study applying ExCAM to a collection of real-world LLM outputs containing human-verified cultural errors and measuring detection accuracy would falsify the result if it is substantially lower than 80%.

Figures

Figures reproduced from arXiv: 2605.29897 by Christoph Leiter, Haiyue Song, Hideki Tanaka, Hour Kaing, Jin Tei, Masao Utiyama, Steffen Eger.

**Figure 1.** Figure 1: Evaluating the cultural awareness of an instruction and its generated output text with ExCAM. The error report contains details on every detected error. Left: a free-text example with the cultural overgeneralization that no Germans like Schnitzel. Right: an impersonation example modified from GlobalOpinionQA (Durmus et al., 2023). In Malaysia, the most common answer to the instruction is “Not too well” … view at source ↗

**Figure 2.** Figure 2: Example of the ExCAM data creation pipeline (also see §3). (1) We collect benchmarks for cultural awareness. In the figure, we show examples of 4 datasets with unaligned data formats. (2) We load each dataset with an aligned structure, mapping each data sample into an instruction-output pair and wrapping it as an input prompt for ExCAM. This is our ground truth data, because it is sourced from existing ver… view at source ↗

**Figure 3.** Figure 3: Out-of-domain LoRA performance of ExCAM compared to the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Heatmap of in-domain ExCAM compared with GPT-5. The columns show the test set that the models are [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: A screenshot of the instructions displayed in our annotation interface. (Part 1) [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: A screenshot of the instructions displayed in our annotation interface. (Part 2) [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: A screenshot of our annotation interface. (Step 1) [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: A screenshot of our annotation interface. (Step 2) [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: A screenshot of our annotation interface. (Step 3) [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Scaled accuracy across 5 languages [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Scaled accuracy of leave-one-out trained LoRAs on all sub datasets. [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

read the original abstract

Evaluating the cultural awareness of large language models is crucial to ensure the fairness of generated text and the generalizability of applications across the world. Recent benchmarks explore cultural goods like food or values like behavior in stressful situations through the lens of question answering or text generation tasks. However, creating these benchmarks requires time-intensive and costly human annotations. Also, benchmarks that evaluate cultural awareness in free text are scarce and often rely on dated evaluation mechanisms. To address this gap, we introduce ExCAM, an Explainable Cultural Awareness Metric, which is, to our knowledge, the first dedicated evaluation metric that identifies, rates and explains cultural errors in instruction-output pairs. To train and evaluate ExCAM, we introduce ExCAM40k, a dataset comprised of nine existing benchmarks that we reformat and enhance with synthetic errors. Compared to several baselines, including GPT-5, ExCAM achieves the highest error detection rate with up to 80% accuracy on a balanced test set. Therefore, ExCAM opens the pathway towards fine-grained and explainable cultural evaluation of free text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ExCAM's 80% accuracy is measured only on synthetic errors whose realism for actual LLM free-text mistakes is untested.

read the letter

ExCAM claims to be the first dedicated explainable metric for cultural errors in free-text LLM outputs. The authors build ExCAM40k by reformatting nine existing benchmarks into instruction-output pairs and injecting synthetic errors, then report up to 80% error detection accuracy on a balanced test set, beating several baselines including GPT-5.

The useful piece is the focus on free text plus explainability. Most cultural benchmarks stay in QA format, and the paper correctly notes the cost of human annotation, so an automated metric that also explains errors could be practical if it works.

The bigger problems are the evaluation setup and missing details. All results sit on the same synthetic-error data, with no human validation that the injected errors match the cultural mistakes models actually produce in unconstrained generation. The abstract supplies no information on model architecture, training procedure, error taxonomy, baseline implementations, or statistical tests, so the performance number cannot be assessed. This leaves open the possibility that the model is mainly learning the synthetic construction process rather than cultural patterns.

The work is aimed at people building or auditing culturally aware LLMs who need scalable evaluation tools. A reader in that area could borrow the dataset construction idea, but would treat the accuracy claim as preliminary until the synthetic-to-real gap is checked.

I would send it for peer review so referees can look at the methods section and any added validation experiments rather than desk-rejecting it on the abstract alone.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ExCAM, an explainable metric for identifying, rating, and explaining cultural errors in LLM instruction-output pairs. It constructs the ExCAM40k dataset from nine existing benchmarks by reformatting them and augmenting with synthetic errors, then reports that ExCAM achieves the highest error detection rate with up to 80% accuracy on a balanced test set, outperforming baselines including GPT-5.

Significance. If the synthetic errors prove representative of real LLM cultural errors in free-text generation, ExCAM would offer a scalable, fine-grained alternative to human-annotated benchmarks for cultural awareness evaluation, addressing the noted scarcity of free-text metrics.

major comments (2)

[Abstract] Abstract: the central claim of up to 80% accuracy supplies no information on training procedure, model architecture, error types, baseline implementations, or statistical significance, rendering the performance result unevaluable.
[Dataset Construction (ExCAM40k)] Dataset construction: training and testing both use the same synthetic-error-augmented ExCAM40k; without human validation that the injected errors match the distribution of cultural errors LLMs produce in unconstrained free-text generation, the reported accuracy cannot be interpreted as evidence of genuine generalization rather than fitting to the augmentation process. The reformatting step from QA-style benchmarks to free-text pairs introduces an additional untested distributional shift.

minor comments (1)

The abstract states that nine existing benchmarks are used but neither names them nor cites the original sources.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to improve clarity and acknowledge limitations.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of up to 80% accuracy supplies no information on training procedure, model architecture, error types, baseline implementations, or statistical significance, rendering the performance result unevaluable.

Authors: We agree that the abstract is too concise and omits key methodological details needed to evaluate the 80% accuracy claim. In the revised manuscript, we will expand the abstract to briefly describe the ExCAM model architecture and training procedure on ExCAM40k, the categories of synthetic cultural errors, the specific baselines including GPT-5, and any statistical significance results. revision: yes
Referee: [Dataset Construction (ExCAM40k)] Dataset construction: training and testing both use the same synthetic-error-augmented ExCAM40k; without human validation that the injected errors match the distribution of cultural errors LLMs produce in unconstrained free-text generation, the reported accuracy cannot be interpreted as evidence of genuine generalization rather than fitting to the augmentation process. The reformatting step from QA-style benchmarks to free-text pairs introduces an additional untested distributional shift.

Authors: We acknowledge this limitation. ExCAM40k is built by reformatting nine existing benchmarks and injecting synthetic errors to enable scalable training and evaluation, as large-scale human annotation of real free-text cultural errors is resource-intensive. Training and testing occur on this augmented dataset to measure detection of the defined error types. However, we agree that this setup does not constitute direct evidence of generalization to real LLM free-text outputs or fully account for reformatting shifts. We will revise the paper to add an explicit limitations section discussing these points, clarify that results are specific to the synthetic test set, and outline future work involving human validation on real generations. revision: partial

Circularity Check

1 steps flagged

ExCAM accuracy reported on test set built by identical synthetic-error construction as training data

specific steps

fitted input called prediction [Abstract]
"To train and evaluate ExCAM, we introduce ExCAM40k, a dataset comprised of nine existing benchmarks that we reformat and enhance with synthetic errors. Compared to several baselines, including GPT-5, ExCAM achieves the highest error detection rate with up to 80% accuracy on a balanced test set."

The accuracy figure is measured on a balanced test set drawn from ExCAM40k. Because ExCAM40k is created by the same reformatting + synthetic error injection process for both training and testing portions, the reported performance reduces to how well the model fits the authors' specific synthetic error distribution rather than demonstrating capability on independently occurring cultural errors in unconstrained text.

full rationale

The paper trains and evaluates ExCAM exclusively on splits of ExCAM40k, whose construction (reformatting existing benchmarks + synthetic error injection) is the same for train and test. The headline 80% accuracy is therefore performance on data generated by the identical process used to create the training distribution. This matches the fitted_input_called_prediction pattern: the reported result is a measure of fit to the authors' synthetic construction rather than an independent test of generalization to real cultural errors in free-text generation. No other circularity patterns (self-definition, self-citation load-bearing, etc.) are present in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5726 in / 1082 out tokens · 33219 ms · 2026-06-29T07:40:18.372179+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 7 canonical work pages · 2 internal anchors

[1]

9We partly used AI assistants for coding and writing assis- tance

https://huggingface.co/mistralai/ Mistral-Small-24B-Instruct-2501. 9We partly used AI assistants for coding and writing assis- tance. Shane Arora, Marzena Karpinska, Hung-Ting Chen, Ipsita Bhattacharjee, Mohit Iyyer, and Eunsol Choi
[2]

In Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 11772–11817, Vienna, Austria

CaLMQA: Exploring culturally specific long- form question answering across 23 languages. In Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 11772–11817, Vienna, Austria. Association for Computational Linguistics. Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao
[3]

Preprint, arXiv:2006.14799

Evaluation of text generation: A survey. Preprint, arXiv:2006.14799. Yanran Chen and Steffen Eger. 2023. MENLI: Robust evaluation metrics from natural language inference. Transactions of the Association for Computational Linguistics, 11:804–825. Yu Ying Chiu, Liwei Jiang, Bill Yuchen Lin, Chan Young Park, Shuyue Stella Li, Sahithya Ravi, Mehar Bhatia, Mar...

work page arXiv 2006
[4]

Towards Measuring the Representation of Subjective Global Opinions in Language Models

Ties matter: Meta-evaluating modern metrics with pairwise accuracy and tie calibration. InPro- ceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 12914– 12929, Singapore. Association for Computational Linguistics. Esin Durmus, Karina Nguyen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Ch...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

InProceedings of the 61st Annual Meet- ing of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 13844–13857, Toronto, Canada

EPIC: Multi-perspective annotation of a cor- pus of irony. InProceedings of the 61st Annual Meet- ing of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 13844–13857, Toronto, Canada. Association for Computational Lin- guistics. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, ...

2025
[6]

Christoph Leiter, Juri Opitz, Daniel Deutsch, Yang Gao, Rotem Dror, and Steffen Eger

Towards explainable evaluation metrics for machine translation.Journal of Machine Learning Research, 25(75):1–49. Christoph Leiter, Juri Opitz, Daniel Deutsch, Yang Gao, Rotem Dror, and Steffen Eger. 2023. The eval4nlp 2023 shared task on prompting large language models as explainable metrics.Preprint, arXiv:2310.19792. Haitao Li, Qian Dong, Junjie Chen, ...

work page arXiv 2023
[7]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Llms-as-judges: A comprehensive sur- vey on llm-based evaluation methods.Preprint, arXiv:2412.05579. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Chen Cecilia Liu, Iryna Gurevych, and Anna Korhonen

work page internal anchor Pith review Pith/arXiv arXiv 2004
[8]

ArXiv:2406.03930 [cs]

Culturally Aware and Adapted NLP: A Tax- onomy and a Survey of the State of the Art.arXiv preprint. ArXiv:2406.03930 [cs]. Arle Lommel, Aljoscha Burchardt, and Hans Uszkor- eit. 2014. Multidimensional quality metrics (mqm): A framework for declaring and describing transla- tion quality metrics.Tradumàtica: tecnologies de la traducció, 0:455–463. Microsoft...

work page arXiv 2014
[9]

InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

LLM evaluators recognize and favor their own generations. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic evalu- ation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Compu- tational Lingu...

work page arXiv 2002
[10]

An evaluation of cultural value alignment in llm.Preprint, arXiv:2504.08863. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Eti...

work page arXiv 2025
[11]

WorldCuisines: A massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 1: Long Papers), pages 3242–3264, Albuquerque, New Mexico. Association ...

2025
[12]

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5967–5994, Singapore

INSTRUCTSCORE: Towards explainable text generation evaluation with automatic feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5967–5994, Singapore. Association for Computa- tional Linguistics. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang...

2023
[13]

cultures

Qwen3 technical report. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- uating text generation with bert. InInternational Conference on Learning Representations. Raoyuan Zhao, Beiduo Chen, Barbara Plank, and Michael A. Hedderich. 2025. MAKIEval: A mul- tilingual automatic WiKidata-based framework for cu...

2020
[14]

Introduce believable, difficult errors that require cultural understanding to identify
[15]

Modify content pointwise instead of appending something to it
[16]

Ensure that the modified texts have the same structure, country, length, language, ethnicity and culture as the original texts
[17]

There is no need to modify both texts, you can choose to only modify the instruction or the generated text, but please make sure to introduce cultural errors in at least one of them
[18]

Error Type

Some examples for cultural errors include: misrepresenting cultural values, stereotyping (e.g., assuming all members of a culture share the same beliefs), providing incorrect information about traditions (like festivals or rituals) and goods (like clothing or food), and showing a lack of understanding of cultural norms. 6, {MINOR/MAJOR} Now add your cultu...

1951

[1] [1]

9We partly used AI assistants for coding and writing assis- tance

https://huggingface.co/mistralai/ Mistral-Small-24B-Instruct-2501. 9We partly used AI assistants for coding and writing assis- tance. Shane Arora, Marzena Karpinska, Hung-Ting Chen, Ipsita Bhattacharjee, Mohit Iyyer, and Eunsol Choi

[2] [2]

In Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 11772–11817, Vienna, Austria

CaLMQA: Exploring culturally specific long- form question answering across 23 languages. In Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 11772–11817, Vienna, Austria. Association for Computational Linguistics. Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao

[3] [3]

Preprint, arXiv:2006.14799

Evaluation of text generation: A survey. Preprint, arXiv:2006.14799. Yanran Chen and Steffen Eger. 2023. MENLI: Robust evaluation metrics from natural language inference. Transactions of the Association for Computational Linguistics, 11:804–825. Yu Ying Chiu, Liwei Jiang, Bill Yuchen Lin, Chan Young Park, Shuyue Stella Li, Sahithya Ravi, Mehar Bhatia, Mar...

work page arXiv 2006

[4] [4]

Towards Measuring the Representation of Subjective Global Opinions in Language Models

Ties matter: Meta-evaluating modern metrics with pairwise accuracy and tie calibration. InPro- ceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 12914– 12929, Singapore. Association for Computational Linguistics. Esin Durmus, Karina Nguyen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Ch...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

InProceedings of the 61st Annual Meet- ing of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 13844–13857, Toronto, Canada

EPIC: Multi-perspective annotation of a cor- pus of irony. InProceedings of the 61st Annual Meet- ing of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 13844–13857, Toronto, Canada. Association for Computational Lin- guistics. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, ...

2025

[6] [6]

Christoph Leiter, Juri Opitz, Daniel Deutsch, Yang Gao, Rotem Dror, and Steffen Eger

Towards explainable evaluation metrics for machine translation.Journal of Machine Learning Research, 25(75):1–49. Christoph Leiter, Juri Opitz, Daniel Deutsch, Yang Gao, Rotem Dror, and Steffen Eger. 2023. The eval4nlp 2023 shared task on prompting large language models as explainable metrics.Preprint, arXiv:2310.19792. Haitao Li, Qian Dong, Junjie Chen, ...

work page arXiv 2023

[7] [7]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Llms-as-judges: A comprehensive sur- vey on llm-based evaluation methods.Preprint, arXiv:2412.05579. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Chen Cecilia Liu, Iryna Gurevych, and Anna Korhonen

work page internal anchor Pith review Pith/arXiv arXiv 2004

[8] [8]

ArXiv:2406.03930 [cs]

Culturally Aware and Adapted NLP: A Tax- onomy and a Survey of the State of the Art.arXiv preprint. ArXiv:2406.03930 [cs]. Arle Lommel, Aljoscha Burchardt, and Hans Uszkor- eit. 2014. Multidimensional quality metrics (mqm): A framework for declaring and describing transla- tion quality metrics.Tradumàtica: tecnologies de la traducció, 0:455–463. Microsoft...

work page arXiv 2014

[9] [9]

InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

LLM evaluators recognize and favor their own generations. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic evalu- ation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Compu- tational Lingu...

work page arXiv 2002

[10] [10]

An evaluation of cultural value alignment in llm.Preprint, arXiv:2504.08863. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Eti...

work page arXiv 2025

[11] [11]

WorldCuisines: A massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 1: Long Papers), pages 3242–3264, Albuquerque, New Mexico. Association ...

2025

[12] [12]

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5967–5994, Singapore

INSTRUCTSCORE: Towards explainable text generation evaluation with automatic feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5967–5994, Singapore. Association for Computa- tional Linguistics. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang...

2023

[13] [13]

cultures

Qwen3 technical report. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- uating text generation with bert. InInternational Conference on Learning Representations. Raoyuan Zhao, Beiduo Chen, Barbara Plank, and Michael A. Hedderich. 2025. MAKIEval: A mul- tilingual automatic WiKidata-based framework for cu...

2020

[14] [14]

Introduce believable, difficult errors that require cultural understanding to identify

[15] [15]

Modify content pointwise instead of appending something to it

[16] [16]

Ensure that the modified texts have the same structure, country, length, language, ethnicity and culture as the original texts

[17] [17]

There is no need to modify both texts, you can choose to only modify the instruction or the generated text, but please make sure to introduce cultural errors in at least one of them

[18] [18]

Error Type

Some examples for cultural errors include: misrepresenting cultural values, stereotyping (e.g., assuming all members of a culture share the same beliefs), providing incorrect information about traditions (like festivals or rituals) and goods (like clothing or food), and showing a lack of understanding of cultural norms. 6, {MINOR/MAJOR} Now add your cultu...

1951