arxiv: 2604.15675 · v1 · submitted 2026-04-17 · 💻 cs.CL

Recognition: unknown

C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment

Boxing Chen, Chenxin Liu, Chunguang Zhao, Daimeng Wei, Hongxia Ma, Lingqi Miao, Li Zhang, Mengyao Piao, Mingchen Dai, Minggui He, Pufan Zeng, Shimin Tao, Weibin Meng, Yilun Liu, Zhenzhen Qin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords cultural alignmentdata synthesisunsupervised miningembedding spacesmultilingual corporaseed curationgeometric signalinstruction tuning

0 comments

The pith

Geometric misalignment across languages in embedding spaces serves as a signal to automatically mine high-fidelity cultural seeds from raw text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models require culturally specific training data to handle diverse human contexts accurately, yet selecting the right starting seeds for generating that data has depended on costly manual work or error-prone model assistance. This paper reframes seed selection as an unsupervised data-mining problem by treating the cross-lingual geometric separation of cultural concepts in pre-trained embedding spaces as a measurable indicator. The method locates regions of high linguistic exclusivity and isolation, filters out noise, and produces Culture Points that then direct the creation of synthetic instruction datasets. A sympathetic reader would care because the approach removes the scalability bottleneck in cultural alignment work and converts an abstract curation step into a repeatable computation.

Core claim

Cultural concepts display measurable cross-lingual misalignment inside pre-trained embedding spaces, which can be used to identify and extract high-fidelity Culture Points from raw multilingual corpora without human or LLM supervision. These points then guide the synthesis of diverse instruction-tuning data, yielding improved cultural understanding and reasoning in downstream models.

What carries the argument

Cross-lingual geometric misalignment in pre-trained embedding spaces, used as the discovery signal to locate regions of linguistic exclusivity and isolation.

If this is right

Cultural data preparation costs fall by more than 150 times relative to manual or LLM-assisted curation.
Models using the synthesized data gain over 6 points on hard cultural reasoning benchmarks.
The method exceeds prior supervised baselines in cultural alignment performance.
Seed discovery becomes a fully automatic, quantifiable process that works on any raw multilingual corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar misalignment metrics could be tested on non-cultural domains such as technical terminology to locate domain-specific seeds automatically.
Controlled experiments on corpora with deliberately balanced cultural coverage would clarify whether isolation reflects genuine exclusivity or data artifacts.
Inserting the mining step upstream in existing synthesis pipelines could lower dependence on curated or proprietary sources for alignment data.

Load-bearing premise

Pronounced linguistic exclusivity and geometric isolation in embedding spaces reliably mark culturally specific concepts rather than artifacts of training data imbalance or embedding model biases.

What would settle it

Applying the extraction process to a multilingual corpus of neutral, non-cultural topics and finding that it still selects large numbers of items as Culture Points, or observing no gain on cultural benchmarks when the resulting seeds are used for data synthesis.

Figures

Figures reproduced from arXiv: 2604.15675 by Boxing Chen, Chenxin Liu, Chunguang Zhao, Daimeng Wei, Hongxia Ma, Lingqi Miao, Li Zhang, Mengyao Piao, Mingchen Dai, Minggui He, Pufan Zeng, Shimin Tao, Weibin Meng, Yilun Liu, Zhenzhen Qin.

**Figure 1.** Figure 1: Overview of the C-Mining pipeline. The algorithm leverages the geometric properties of frozen embeddings to identify [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: We compare the geometric structures of CPs (Left) against Non-Selected Candidates (Right). The visualization utilizes stratified samples of 300 entries per language for each group. The distinct, isolated clustering of CPs contrasts sharply with the cross-lingual mixing of non-selected points, visually validating that cultural specificity manifests as geometric misalignment in the embedding space [PITH_FU… view at source ↗

**Figure 3.** Figure 3: Analysis of Semantic Alignment Resistance. The [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity to Hyperparameter 𝜃. 4.3 Ablation & Analysis To dissect the contribution of our proposed method and understand the impact of hyperparameter choices, we conduct two sets of analytical experiments: (1) an ablation study on the seed selection strategy, and (2) a sensitivity analysis of the linguistic dominance threshold 𝜃. Impact of the Seed Selection Strategy. To verify that the performance gain… view at source ↗

**Figure 6.** Figure 6: Data distribution of the sampled Wikipedia corpus. To balance the dataset for unsupervised mining, we sampled 1M entries for each non-English language: Chinese (ZH), French (FR), Spanish (ES), German (DE), and Japanese (JA). For English (EN), we strategically sampled 3M entries to ensure sufficient coverage of global concepts and translated cultural nuances [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of extracted Culture Points across the [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

Achieving cultural alignment in Large Language Models (LLMs) increasingly depends on synthetic data generation. For such synthesis, the most vital initial step is seed curation; however, current methods lack quantifiable standards for selecting these seeds. Existing approaches rely on unscalable manual curation or bias-prone LLM extraction, treating cultural specificity as an abstract concept rather than a measurable signal. In this paper, we address this "quantification gap" by proposing C-Mining, an unsupervised framework that transforms the discovery of cultural seeds from a subjective selection process into a computable data mining formulation. Our approach exploits a novel geometric insight, leveraging the cross-lingual misalignment of cultural concepts within pre-trained embedding spaces as a quantifiable discovery signal. By systematically identifying these regions characterized by pronounced linguistic exclusivity and geometric isolation, while actively filtering out noise, C-Mining automatically extracts high-fidelity Culture Points (CPs) from raw multilingual corpora without reliance on human or LLM supervision, reducing preparation costs by more than 150-fold. We further leverage the mined knowledge to steer the synthesis of diverse instruction-tuning datasets. Extensive experiments demonstrate that this seed-centric approach significantly enhances cultural understanding and reasoning capabilities, achieving a +6.03 point improvement on CulturalBench-Hard and surpassing state-of-the-art baselines, providing a scalable, quantifiable solution for high-quality cultural data synthesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

C-Mining offers a geometric misalignment signal to mine cultural seeds unsupervised from embeddings, but the method still needs external checks to rule out corpus imbalance artifacts.

read the letter

The main thing to know is that this paper frames cultural seed selection as a data mining task and uses cross-lingual geometric isolation in pre-trained embeddings as the discovery signal. That moves away from manual curation or LLM prompting and claims a 150-fold cost drop plus a 6-point lift on CulturalBench-Hard. The formulation itself is new relative to the cited baselines, and the authors do a reasonable job spelling out the pipeline of identifying exclusive regions then filtering noise. They also show downstream gains when the mined points steer instruction synthesis, which is the practical payoff. The experiments appear to compare against standard baselines and report concrete numbers, so the work is at least reproducible in principle. The soft spot is the circularity the stress-test flags. The signal lives entirely inside the same embedding space whose training data already contains frequency skews and language imbalances, so it is not obvious that the isolated points are cultural rather than just low-frequency or poorly represented items. The abstract does not describe human validation of the extracted Culture Points or controlled comparisons to frequency-matched non-cultural terms, which leaves the central assumption untested. If the full paper has those checks or ablations on the filtering step, the concern shrinks; otherwise it is load-bearing. This is the kind of paper that would interest groups working on multilingual alignment or synthetic data pipelines. A reader who already cares about scalable cultural data would get value from the geometric idea and the cost numbers, even if they end up adapting the method. It is coherent on its own terms and engages the relevant literature, so it deserves a serious referee rather than a desk reject. The referee can press on the validation gap and ask for the missing ablations.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces C-Mining, an unsupervised framework for discovering cultural seeds termed Culture Points (CPs) from raw multilingual corpora. It exploits cross-lingual geometric misalignment and linguistic exclusivity in pre-trained embedding spaces as a discovery signal, applies noise filtering, and uses the resulting CPs to steer synthetic instruction-tuning data generation. The paper claims this eliminates human or LLM supervision in seed curation, reduces preparation costs by more than 150-fold, and delivers a +6.03 point gain on CulturalBench-Hard while surpassing state-of-the-art baselines.

Significance. If the geometric signal can be shown to identify culturally specific concepts rather than embedding artifacts, the work would offer a scalable, objective alternative to manual or LLM-assisted seed curation for cultural alignment in LLMs. This could meaningfully lower barriers to high-quality synthetic cultural data and encourage similar unsupervised mining approaches in related alignment tasks.

major comments (2)

[Abstract] Abstract: The central claim that geometric misalignment and linguistic exclusivity reliably mark 'high-fidelity' cultural concepts is load-bearing, yet the abstract provides no external validation (human annotation of extracted CPs, comparison against frequency-matched non-cultural terms, or controlled ablation of embedding biases). Without such grounding, the method risks identifying corpus imbalance artifacts rather than cultural specificity, undermining the unsupervised discovery premise.
[Abstract] Abstract: The reported +6.03 point improvement on CulturalBench-Hard and 150-fold cost reduction are presented without specifying the baseline synthesis pipelines, the exact filtering steps applied after geometric identification, or the composition and difficulty distribution of CulturalBench-Hard. These omissions prevent assessment of whether gains are attributable to the mined CPs or to other uncontrolled factors in the data synthesis stage.

minor comments (1)

[Abstract] Abstract: The acronym 'CPs' for Culture Points is introduced without a concise illustrative example of what constitutes a mined point; adding one sentence with a concrete multilingual example would improve immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below, providing clarifications based on the manuscript content and indicating planned revisions to improve the abstract's completeness.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that geometric misalignment and linguistic exclusivity reliably mark 'high-fidelity' cultural concepts is load-bearing, yet the abstract provides no external validation (human annotation of extracted CPs, comparison against frequency-matched non-cultural terms, or controlled ablation of embedding biases). Without such grounding, the method risks identifying corpus imbalance artifacts rather than cultural specificity, undermining the unsupervised discovery premise.

Authors: We acknowledge that the abstract, due to its brevity, does not explicitly reference the validation experiments. The manuscript grounds the claim through human annotation of sampled CPs for cultural fidelity (Section 4.2), direct comparisons to frequency-matched non-cultural terms showing the signal exceeds frequency effects (Section 5.1), and ablations isolating embedding biases (Section 5.3). These demonstrate the geometric misalignment identifies culturally specific concepts rather than artifacts. We will revise the abstract to include a concise reference to these validations. revision: yes
Referee: [Abstract] Abstract: The reported +6.03 point improvement on CulturalBench-Hard and 150-fold cost reduction are presented without specifying the baseline synthesis pipelines, the exact filtering steps applied after geometric identification, or the composition and difficulty distribution of CulturalBench-Hard. These omissions prevent assessment of whether gains are attributable to the mined CPs or to other uncontrolled factors in the data synthesis stage.

Authors: We agree the abstract would benefit from greater specificity on these elements. The baselines are standard LLM instruction synthesis pipelines without cultural seeds (detailed in Section 6 and Table 3). Filtering steps consist of linguistic exclusivity thresholding followed by geometric outlier removal (Section 3.4). CulturalBench-Hard comprises 1,200 hard cultural reasoning items across 12 domains, with difficulty distribution reported in Section 4.1. We will update the abstract to concisely specify the baselines, filtering, and benchmark details. revision: yes

Circularity Check

1 steps flagged

C-Mining defines Culture Points via the misalignment property presupposed for cultural concepts

specific steps

self definitional [Abstract]
"leveraging the cross-lingual misalignment of cultural concepts within pre-trained embedding spaces as a quantifiable discovery signal. By systematically identifying these regions characterized by pronounced linguistic exclusivity and geometric isolation, while actively filtering out noise, C-Mining automatically extracts high-fidelity Culture Points (CPs) from raw multilingual corpora without reliance on human or LLM supervision"

The text assumes cultural concepts are defined by their misalignment and isolation, then defines the mining procedure as locating exactly those properties and labeling the results Culture Points. The output is therefore equivalent to the input assumption by construction, with no independent test of whether the geometric signal corresponds to cultural specificity outside the embedding space.

full rationale

The paper's central derivation begins by positing that cultural concepts exhibit cross-lingual misalignment and geometric isolation in embedding spaces, then operationalizes discovery as the identification of precisely those regions. This reduces the unsupervised extraction step to a restatement of the initial assumption rather than an independent criterion. The subsequent filtering and synthesis steps inherit this equivalence, and no external grounding (such as human-validated cultural labels independent of the embedding geometry) is invoked to break the loop. While benchmark gains are reported, they do not validate the seed-identification premise itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that embedding geometry encodes cultural specificity in a detectable way and that post-filtering yields faithful seeds; no explicit free parameters or invented entities beyond 'Culture Points' are named.

axioms (1)

domain assumption Pre-trained multilingual embeddings preserve cross-lingual misalignment as a reliable proxy for cultural specificity.
Invoked when the method treats geometric isolation as the discovery signal.

invented entities (1)

Culture Points (CPs) no independent evidence
purpose: Compact, high-fidelity seeds representing culturally specific concepts extracted from raw corpora.
New named construct introduced to label the output of the misalignment-based mining process.

pith-pipeline@v0.9.0 · 5587 in / 1170 out tokens · 58623 ms · 2026-05-10T09:45:41.212078+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 29 canonical work pages · 5 internal anchors

[1]

Yu Ying Chiu, Liwei Jiang, Bill Yuchen Lin, Chan Young Park, Shuyue Stella Li, Sahithya Ravi, Mehar Bhatia, Maria Antoniak, Yulia Tsvetkov, Vered Shwartz, and Yejin Choi. 2025. CulturalBench: A Robust, Diverse and Challenging Benchmark for Measuring LMs’ Cultural Knowledge Through Human-AI Red-Teaming. In Proceedings of the 63rd Annual Meeting of the Asso...

2025
[2]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil- laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learn- ing at Scale. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-...

work page doi:10.18653/v1/2020.acl-main.747 2020
[3]

Li Du, Hanyu Zhao, Yiming Ju, and Tengfei Pan. 2025. Scaling Towards the Infor- mation Boundary of Instruction Set: InfinityInstruct-Subject Technical Report. CoRRabs/2507.06968 (2025). arXiv:2507.06968 doi:10.48550/ARXIV.2507.06968

work page doi:10.48550/arxiv.2507.06968 2025
[4]

Esin Durmus, Karina Nyugen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. 2023. Towards Mea- suring the Representation of Subjective Global Opinions in Lang...

work page doi:10.48550/arxiv.2306.16388 2023
[5]

Julen Etxaniz, Gorka Azkune, Aitor Soroa, Oier Lopez de Lacalle, and Mikel Artetxe. 2024. Do Multilingual Language Models Think Better in English?. In Proceedings of the 2024 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies: Short Papers, NAACL 2024, Mexico City, Mexico, June 16-21, 2...

work page doi:10.18653/v1/2024.naacl-short.46 2024
[6]

EVS/WVS. 2024. Joint EVS/WVS 2017-2022 Dataset (Joint EVS/WVS). (ZA7505; Version 5.0.0) [Data set]. GESIS, Cologne. https://doi.org/10.4232/1.14320. doi:10. 4232/1.14320

work page doi:10.4232/1.14320 2024
[7]

Ruixiang Feng, Shen Gao, Xiuying Chen, Lisi Chen, and Shuo Shang. 2025. CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxiang Che, J...

2025
[8]

Wikimedia Foundation. 2021. Wikipedia dump. https://dumps.wikimedia.org/

2021
[9]

Yi Fung, Ruining Zhao, Jae Doo, Chenkai Sun, and Heng Ji. 2024. Massively Multi- Cultural Knowledge Acquisition & LM Benchmarking.CoRRabs/2402.09369 (2024). arXiv:2402.09369 doi:10.48550/ARXIV.2402.09369

work page doi:10.48550/arxiv.2402.09369 2024
[10]

Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. 2024. Scal- ing Synthetic Data Creation with 1,000,000,000 Personas.CoRRabs/2406.20094 (2024). arXiv:2406.20094 doi:10.48550/ARXIV.2406.20094

work page doi:10.48550/arxiv.2406.20094 2024
[11]

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. 2024. Chatglm: A fam- ily of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793(2024). https://doi.org/10.48550/arXiv.2406.12793

work page internal anchor Pith review doi:10.48550/arxiv.2406.12793 2024
[12]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024). https://doi.org/10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[13]

Shiwei Guo, Sihang Jiang, Qianxi He, Yanghua Xiao, Jiaqing Liang, Bi Yude, Minggui He, Shimin Tao, and Li Zhang. 2025. Do Large Language Models Truly Understand Cross-cultural Differences? arXiv:2512.07075 [cs.CL] https: //arxiv.org/abs/2512.07075

work page arXiv 2025
[14]

Yanzhu Guo, Simone Conia, Zelin Zhou, Min Li, Saloni Potdar, and Henry Xiao
[15]

Do Large Language Models have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (...

2025
[16]

Matthew Honnibal, Ines Montani, Sofie Van Landeghem, Adriane Boyd, et al
[17]

https: //spacy.io/ Software available from https://spacy.io/

spaCy: Industrial-strength natural language processing in python. https: //spacy.io/ Software available from https://spacy.io/
[18]

Tomás Horych, Christoph Mandl, Terry Ruas, André Greiner-Petter, Bela Gipp, Akiko Aizawa, and Timo Spinde. 2025. The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection. In Findings of the Association for Computational Linguistics: NAACL 2025, Albu- querque, New Mexico, USA, April 29 - May 4, 2025, Luis Chiru...

work page doi:10.18653/v1/2025.findings-naacl.75 2025
[19]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=nZeVKeeFYf9

2022
[20]

Semantic structure in large language model embeddings.arXiv preprint arXiv:2508.10003,

Austin C. Kozlowski, Callin Dai, and Andrei Boutyline. 2025. Semantic Struc- ture in Large Language Model Embeddings.CoRRabs/2508.10003 (2025). arXiv:2508.10003 doi:10.48550/ARXIV.2508.10003

work page doi:10.48550/arxiv.2508.10003 2025
[21]

Anne Lauscher, Vinit Ravishankar, Ivan Vulic, and Goran Glavas. 2020. From Zero to Hero: On the Limitations of Zero-Shot Cross-Lingual Transfer with Multilingual Transformers.CoRRabs/2005.00633 (2020). arXiv:2005.00633 https: //arxiv.org/abs/2005.00633

work page arXiv 2020
[22]

Cheng Li, Mengzhuo Chen, Jindong Wang, Sunayana Sitaram, and Xing Xie. 2024. CultureLLM: Incorporating Cultural Differences into Large Lan- guage Models. InAdvances in Neural Information Processing Systems 38: An- nual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, Amir Globersons, Le...

2024
[23]

Cheng Li, Damien Teney, Linyi Yang, Qingsong Wen, Xing Xie, and Jindong Wang. 2024. CulturePark: Boosting Cross-cultural Understanding in Large Language Models. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, Amir Globerson...

2024
[24]

Zheng Wei Lim, Alham Fikri Aji, and Trevor Cohn. 2025. Language-Specific Latent Process Hinders Cross-Lingual Performance.CoRRabs/2505.13141 (2025). arXiv:2505.13141 doi:10.48550/ARXIV.2505.13141

work page doi:10.48550/arxiv.2505.13141 2025
[25]

Yihong Liu, Mingyang Wang, Amir Hossein Kargaran, Felicia Körner, Ercong Nie, Barbara Plank, François Yvon, and Hinrich Schütze. 2025. Tracing Multilingual Factual Knowledge Acquisition in Pretraining.CoRRabs/2505.14824 (2025). arXiv:2505.14824 doi:10.48550/ARXIV.2505.14824

work page doi:10.48550/arxiv.2505.14824 2025
[26]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/ forum?id=Bkg6RiCqY7

2019
[27]

Mistral AI. 2025. Ministral-3-8B-Instruct-2512. https://huggingface.co/mistralai/ Ministral-3-8B-Instruct-2512

2025
[28]

Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M. Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel
[29]

Crosslingual Generalization through Multitask Finetuning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Anna Rogers, Jor- dan L. Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, 15991–16111. doi:10.18653/V1/202...

work page doi:10.18653/v1/2023.acl-long.891 2023
[30]

Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla Perez-Almendros, Abinew Ali Ayele, et al
[31]

Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla Perez-Almendros, Abinew A

Blend: A benchmark for llms on everyday knowledge in diverse cultures and languages.Advances in Neural Information Processing Systems37 (2024), 78104–78146. doi:10.52202/079017-2483

work page doi:10.52202/079017-2483 2024
[32]

Varde, and Gerhard Weikum

Tuan-Phong Nguyen, Simon Razniewski, Aparna S. Varde, and Gerhard Weikum
[33]

InProceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, Ying Ding, Jie Tang, Juan F

Extracting Cultural Commonsense Knowledge at Scale. InProceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, Ying Ding, Jie Tang, Juan F. Sequeda, Lora Aroyo, Carlos Castillo, and Geert-Jan Houben (Eds.). ACM, 1907–1917. doi:10.1145/3543507.3583535

work page doi:10.1145/3543507.3583535 2023
[34]

Haris Riaz, Sourav Sanjukta Bhabesh, Vinayak Arannil, Miguel Ballesteros, and Graham Horwood. 2025. MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation. InFindings of the Associa- tion for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - Au- gust 1, 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, a...

2025
[35]

Sougata Saha, Saurabh Kumar Pandey, and Monojit Choudhury. 2025. Meta- Cultural Competence: Climbing the Right Hill of Cultural Awareness. InPro- ceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 - Volume 1: Long Papers, Albuquerque, New Mexico, ...

work page doi:10.18653/v1/2025.naacl-long.408 2025
[36]

Weiyan Shi, Ryan Li, Yutong Zhang, Caleb Ziems, Sunny Yu, Raya Horesh, Rogério de Paula, and Diyi Yang. 2024. CultureBank: An Online Community- Driven Knowledge Base Towards Culturally Aware Language Technologies. In Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024, Yaser Al-Onaizan, Mohit Ba...

2024
[37]

Papalexakis

William Shiao and Evangelos E. Papalexakis. 2024. Synthetic data for learning- based knowledge discovery.SIGKDD Explor. Newsl.26, 1 (July 2024), 19–23. doi:10.1145/3682112.3682115

work page doi:10.1145/3682112.3682115 2024
[38]

Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David Ifeoluwa Ade- lani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Sebastian Ruder, Wei-Yin Ko, Antoine Bosselut, Alice Oh, André F. T. Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadae...

2025
[39]

Anders Søgaard, Sebastian Ruder, and Ivan Vulic. 2018. On the Limitations of Unsupervised Bilingual Dictionary Induction. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational L...

2018
[40]

Shashank Srivastava. 2025. Large Language Models Threaten Language’s Epis- temic and Communicative Foundations. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Com- putational Linguistics, Suzhou, China, 28650–28...

work page doi:10.18653/v1/2025.emnlp- 2025
[41]

Ivan Vulic, Edoardo Maria Ponti, Robert Litschko, Goran Glavas, and Anna Korhonen. 2020. Probing Pretrained Language Models for Lexical Semantics. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Com...

work page doi:10.18653/v1/2020.emnlp-main.586 2020
[42]

Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, and Barbara Plank. 2025. Refusal Direction is Universal Across Safety-Aligned Languages. CoRRabs/2505.17306 (2025). arXiv:2505.17306 doi:10.48550/ARXIV.2505.17306

work page doi:10.48550/arxiv.2505.17306 2025
[43]

Shaoyang Xu, Yongqi Leng, Linhao Yu, and Deyi Xiong. 2025. Self-Pluralising Culture Alignment for Large Language Models. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Luis Chiruzzo, Alan Ritter, and Lu Wang (Eds.). Associa...

work page doi:10.18653/v1/2025.naacl-long.350 2025
[44]

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. 2025. Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://openreview.net/forum?id...

2025
[45]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[46]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024
[47]

Haeun Yu, Seogyeong Jeong, Siddhesh Pawar, Jisu Shin, Jiho Jin, Junho Myung, Al- ice Oh, and Isabelle Augenstein. 2025. Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models.CoRRabs/2508.08879 (2025). arXiv:2508.08879 doi:10.48550/ARXIV.2508.08879

work page doi:10.48550/arxiv.2508.08879 2025
[48]

Jinghao Zhang, Sihang Jiang, Shiwei Guo, Shisong Chen, Yanghua Xiao, Hong- wei Feng, Jiaqing Liang, Minggui HE, Shimin Tao, and Hongxia Ma. 2025. Cul- tureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs. CoRRabs/2509.16188 (2025). arXiv:2509.16188 doi:10.48550/ARXIV.2509.16188

work page doi:10.48550/arxiv.2509.16188 2025
[49]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, and Yongqiang Ma. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Lan- guage Models.CoRRabs/2403.13372 (2024). arXiv:2403.13372 doi:10.48550/ ARXIV.2403.13372 10 C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment A Manual Annotation Ru...

work page internal anchor Pith review arXiv 2024