Recognition: unknown
C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment
Pith reviewed 2026-05-10 09:45 UTC · model grok-4.3
The pith
Geometric misalignment across languages in embedding spaces serves as a signal to automatically mine high-fidelity cultural seeds from raw text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cultural concepts display measurable cross-lingual misalignment inside pre-trained embedding spaces, which can be used to identify and extract high-fidelity Culture Points from raw multilingual corpora without human or LLM supervision. These points then guide the synthesis of diverse instruction-tuning data, yielding improved cultural understanding and reasoning in downstream models.
What carries the argument
Cross-lingual geometric misalignment in pre-trained embedding spaces, used as the discovery signal to locate regions of linguistic exclusivity and isolation.
If this is right
- Cultural data preparation costs fall by more than 150 times relative to manual or LLM-assisted curation.
- Models using the synthesized data gain over 6 points on hard cultural reasoning benchmarks.
- The method exceeds prior supervised baselines in cultural alignment performance.
- Seed discovery becomes a fully automatic, quantifiable process that works on any raw multilingual corpus.
Where Pith is reading between the lines
- Similar misalignment metrics could be tested on non-cultural domains such as technical terminology to locate domain-specific seeds automatically.
- Controlled experiments on corpora with deliberately balanced cultural coverage would clarify whether isolation reflects genuine exclusivity or data artifacts.
- Inserting the mining step upstream in existing synthesis pipelines could lower dependence on curated or proprietary sources for alignment data.
Load-bearing premise
Pronounced linguistic exclusivity and geometric isolation in embedding spaces reliably mark culturally specific concepts rather than artifacts of training data imbalance or embedding model biases.
What would settle it
Applying the extraction process to a multilingual corpus of neutral, non-cultural topics and finding that it still selects large numbers of items as Culture Points, or observing no gain on cultural benchmarks when the resulting seeds are used for data synthesis.
Figures
read the original abstract
Achieving cultural alignment in Large Language Models (LLMs) increasingly depends on synthetic data generation. For such synthesis, the most vital initial step is seed curation; however, current methods lack quantifiable standards for selecting these seeds. Existing approaches rely on unscalable manual curation or bias-prone LLM extraction, treating cultural specificity as an abstract concept rather than a measurable signal. In this paper, we address this "quantification gap" by proposing C-Mining, an unsupervised framework that transforms the discovery of cultural seeds from a subjective selection process into a computable data mining formulation. Our approach exploits a novel geometric insight, leveraging the cross-lingual misalignment of cultural concepts within pre-trained embedding spaces as a quantifiable discovery signal. By systematically identifying these regions characterized by pronounced linguistic exclusivity and geometric isolation, while actively filtering out noise, C-Mining automatically extracts high-fidelity Culture Points (CPs) from raw multilingual corpora without reliance on human or LLM supervision, reducing preparation costs by more than 150-fold. We further leverage the mined knowledge to steer the synthesis of diverse instruction-tuning datasets. Extensive experiments demonstrate that this seed-centric approach significantly enhances cultural understanding and reasoning capabilities, achieving a +6.03 point improvement on CulturalBench-Hard and surpassing state-of-the-art baselines, providing a scalable, quantifiable solution for high-quality cultural data synthesis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces C-Mining, an unsupervised framework for discovering cultural seeds termed Culture Points (CPs) from raw multilingual corpora. It exploits cross-lingual geometric misalignment and linguistic exclusivity in pre-trained embedding spaces as a discovery signal, applies noise filtering, and uses the resulting CPs to steer synthetic instruction-tuning data generation. The paper claims this eliminates human or LLM supervision in seed curation, reduces preparation costs by more than 150-fold, and delivers a +6.03 point gain on CulturalBench-Hard while surpassing state-of-the-art baselines.
Significance. If the geometric signal can be shown to identify culturally specific concepts rather than embedding artifacts, the work would offer a scalable, objective alternative to manual or LLM-assisted seed curation for cultural alignment in LLMs. This could meaningfully lower barriers to high-quality synthetic cultural data and encourage similar unsupervised mining approaches in related alignment tasks.
major comments (2)
- [Abstract] Abstract: The central claim that geometric misalignment and linguistic exclusivity reliably mark 'high-fidelity' cultural concepts is load-bearing, yet the abstract provides no external validation (human annotation of extracted CPs, comparison against frequency-matched non-cultural terms, or controlled ablation of embedding biases). Without such grounding, the method risks identifying corpus imbalance artifacts rather than cultural specificity, undermining the unsupervised discovery premise.
- [Abstract] Abstract: The reported +6.03 point improvement on CulturalBench-Hard and 150-fold cost reduction are presented without specifying the baseline synthesis pipelines, the exact filtering steps applied after geometric identification, or the composition and difficulty distribution of CulturalBench-Hard. These omissions prevent assessment of whether gains are attributable to the mined CPs or to other uncontrolled factors in the data synthesis stage.
minor comments (1)
- [Abstract] Abstract: The acronym 'CPs' for Culture Points is introduced without a concise illustrative example of what constitutes a mined point; adding one sentence with a concrete multilingual example would improve immediate clarity for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment point by point below, providing clarifications based on the manuscript content and indicating planned revisions to improve the abstract's completeness.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that geometric misalignment and linguistic exclusivity reliably mark 'high-fidelity' cultural concepts is load-bearing, yet the abstract provides no external validation (human annotation of extracted CPs, comparison against frequency-matched non-cultural terms, or controlled ablation of embedding biases). Without such grounding, the method risks identifying corpus imbalance artifacts rather than cultural specificity, undermining the unsupervised discovery premise.
Authors: We acknowledge that the abstract, due to its brevity, does not explicitly reference the validation experiments. The manuscript grounds the claim through human annotation of sampled CPs for cultural fidelity (Section 4.2), direct comparisons to frequency-matched non-cultural terms showing the signal exceeds frequency effects (Section 5.1), and ablations isolating embedding biases (Section 5.3). These demonstrate the geometric misalignment identifies culturally specific concepts rather than artifacts. We will revise the abstract to include a concise reference to these validations. revision: yes
-
Referee: [Abstract] Abstract: The reported +6.03 point improvement on CulturalBench-Hard and 150-fold cost reduction are presented without specifying the baseline synthesis pipelines, the exact filtering steps applied after geometric identification, or the composition and difficulty distribution of CulturalBench-Hard. These omissions prevent assessment of whether gains are attributable to the mined CPs or to other uncontrolled factors in the data synthesis stage.
Authors: We agree the abstract would benefit from greater specificity on these elements. The baselines are standard LLM instruction synthesis pipelines without cultural seeds (detailed in Section 6 and Table 3). Filtering steps consist of linguistic exclusivity thresholding followed by geometric outlier removal (Section 3.4). CulturalBench-Hard comprises 1,200 hard cultural reasoning items across 12 domains, with difficulty distribution reported in Section 4.1. We will update the abstract to concisely specify the baselines, filtering, and benchmark details. revision: yes
Circularity Check
C-Mining defines Culture Points via the misalignment property presupposed for cultural concepts
specific steps
-
self definitional
[Abstract]
"leveraging the cross-lingual misalignment of cultural concepts within pre-trained embedding spaces as a quantifiable discovery signal. By systematically identifying these regions characterized by pronounced linguistic exclusivity and geometric isolation, while actively filtering out noise, C-Mining automatically extracts high-fidelity Culture Points (CPs) from raw multilingual corpora without reliance on human or LLM supervision"
The text assumes cultural concepts are defined by their misalignment and isolation, then defines the mining procedure as locating exactly those properties and labeling the results Culture Points. The output is therefore equivalent to the input assumption by construction, with no independent test of whether the geometric signal corresponds to cultural specificity outside the embedding space.
full rationale
The paper's central derivation begins by positing that cultural concepts exhibit cross-lingual misalignment and geometric isolation in embedding spaces, then operationalizes discovery as the identification of precisely those regions. This reduces the unsupervised extraction step to a restatement of the initial assumption rather than an independent criterion. The subsequent filtering and synthesis steps inherit this equivalence, and no external grounding (such as human-validated cultural labels independent of the embedding geometry) is invoked to break the loop. While benchmark gains are reported, they do not validate the seed-identification premise itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained multilingual embeddings preserve cross-lingual misalignment as a reliable proxy for cultural specificity.
invented entities (1)
-
Culture Points (CPs)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Yu Ying Chiu, Liwei Jiang, Bill Yuchen Lin, Chan Young Park, Shuyue Stella Li, Sahithya Ravi, Mehar Bhatia, Maria Antoniak, Yulia Tsvetkov, Vered Shwartz, and Yejin Choi. 2025. CulturalBench: A Robust, Diverse and Challenging Benchmark for Measuring LMs’ Cultural Knowledge Through Human-AI Red-Teaming. In Proceedings of the 63rd Annual Meeting of the Asso...
2025
-
[2]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil- laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learn- ing at Scale. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-...
-
[3]
Li Du, Hanyu Zhao, Yiming Ju, and Tengfei Pan. 2025. Scaling Towards the Infor- mation Boundary of Instruction Set: InfinityInstruct-Subject Technical Report. CoRRabs/2507.06968 (2025). arXiv:2507.06968 doi:10.48550/ARXIV.2507.06968
-
[4]
Esin Durmus, Karina Nyugen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. 2023. Towards Mea- suring the Representation of Subjective Global Opinions in Lang...
-
[5]
Julen Etxaniz, Gorka Azkune, Aitor Soroa, Oier Lopez de Lacalle, and Mikel Artetxe. 2024. Do Multilingual Language Models Think Better in English?. In Proceedings of the 2024 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies: Short Papers, NAACL 2024, Mexico City, Mexico, June 16-21, 2...
-
[6]
EVS/WVS. 2024. Joint EVS/WVS 2017-2022 Dataset (Joint EVS/WVS). (ZA7505; Version 5.0.0) [Data set]. GESIS, Cologne. https://doi.org/10.4232/1.14320. doi:10. 4232/1.14320
-
[7]
Ruixiang Feng, Shen Gao, Xiuying Chen, Lisi Chen, and Shuo Shang. 2025. CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxiang Che, J...
2025
-
[8]
Wikimedia Foundation. 2021. Wikipedia dump. https://dumps.wikimedia.org/
2021
-
[9]
Yi Fung, Ruining Zhao, Jae Doo, Chenkai Sun, and Heng Ji. 2024. Massively Multi- Cultural Knowledge Acquisition & LM Benchmarking.CoRRabs/2402.09369 (2024). arXiv:2402.09369 doi:10.48550/ARXIV.2402.09369
-
[10]
Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. 2024. Scal- ing Synthetic Data Creation with 1,000,000,000 Personas.CoRRabs/2406.20094 (2024). arXiv:2406.20094 doi:10.48550/ARXIV.2406.20094
-
[11]
Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. 2024. Chatglm: A fam- ily of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793(2024). https://doi.org/10.48550/arXiv.2406.12793
work page internal anchor Pith review doi:10.48550/arxiv.2406.12793 2024
-
[12]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024). https://doi.org/10.48550/arXiv.2407.21783
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
- [13]
-
[14]
Yanzhu Guo, Simone Conia, Zelin Zhou, Min Li, Saloni Potdar, and Henry Xiao
-
[15]
Do Large Language Models have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (...
2025
-
[16]
Matthew Honnibal, Ines Montani, Sofie Van Landeghem, Adriane Boyd, et al
-
[17]
https: //spacy.io/ Software available from https://spacy.io/
spaCy: Industrial-strength natural language processing in python. https: //spacy.io/ Software available from https://spacy.io/
-
[18]
Tomás Horych, Christoph Mandl, Terry Ruas, André Greiner-Petter, Bela Gipp, Akiko Aizawa, and Timo Spinde. 2025. The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection. In Findings of the Association for Computational Linguistics: NAACL 2025, Albu- querque, New Mexico, USA, April 29 - May 4, 2025, Luis Chiru...
-
[19]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=nZeVKeeFYf9
2022
-
[20]
Semantic structure in large language model embeddings.arXiv preprint arXiv:2508.10003,
Austin C. Kozlowski, Callin Dai, and Andrei Boutyline. 2025. Semantic Struc- ture in Large Language Model Embeddings.CoRRabs/2508.10003 (2025). arXiv:2508.10003 doi:10.48550/ARXIV.2508.10003
- [21]
-
[22]
Cheng Li, Mengzhuo Chen, Jindong Wang, Sunayana Sitaram, and Xing Xie. 2024. CultureLLM: Incorporating Cultural Differences into Large Lan- guage Models. InAdvances in Neural Information Processing Systems 38: An- nual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, Amir Globersons, Le...
2024
-
[23]
Cheng Li, Damien Teney, Linyi Yang, Qingsong Wen, Xing Xie, and Jindong Wang. 2024. CulturePark: Boosting Cross-cultural Understanding in Large Language Models. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, Amir Globerson...
2024
-
[24]
Zheng Wei Lim, Alham Fikri Aji, and Trevor Cohn. 2025. Language-Specific Latent Process Hinders Cross-Lingual Performance.CoRRabs/2505.13141 (2025). arXiv:2505.13141 doi:10.48550/ARXIV.2505.13141
-
[25]
Yihong Liu, Mingyang Wang, Amir Hossein Kargaran, Felicia Körner, Ercong Nie, Barbara Plank, François Yvon, and Hinrich Schütze. 2025. Tracing Multilingual Factual Knowledge Acquisition in Pretraining.CoRRabs/2505.14824 (2025). arXiv:2505.14824 doi:10.48550/ARXIV.2505.14824
-
[26]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/ forum?id=Bkg6RiCqY7
2019
-
[27]
Mistral AI. 2025. Ministral-3-8B-Instruct-2512. https://huggingface.co/mistralai/ Ministral-3-8B-Instruct-2512
2025
-
[28]
Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M. Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel
-
[29]
Crosslingual Generalization through Multitask Finetuning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Anna Rogers, Jor- dan L. Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, 15991–16111. doi:10.18653/V1/202...
-
[30]
Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla Perez-Almendros, Abinew Ali Ayele, et al
-
[31]
Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla Perez-Almendros, Abinew A
Blend: A benchmark for llms on everyday knowledge in diverse cultures and languages.Advances in Neural Information Processing Systems37 (2024), 78104–78146. doi:10.52202/079017-2483
-
[32]
Varde, and Gerhard Weikum
Tuan-Phong Nguyen, Simon Razniewski, Aparna S. Varde, and Gerhard Weikum
-
[33]
Extracting Cultural Commonsense Knowledge at Scale. InProceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, Ying Ding, Jie Tang, Juan F. Sequeda, Lora Aroyo, Carlos Castillo, and Geert-Jan Houben (Eds.). ACM, 1907–1917. doi:10.1145/3543507.3583535
-
[34]
Haris Riaz, Sourav Sanjukta Bhabesh, Vinayak Arannil, Miguel Ballesteros, and Graham Horwood. 2025. MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation. InFindings of the Associa- tion for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - Au- gust 1, 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, a...
2025
-
[35]
Sougata Saha, Saurabh Kumar Pandey, and Monojit Choudhury. 2025. Meta- Cultural Competence: Climbing the Right Hill of Cultural Awareness. InPro- ceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 - Volume 1: Long Papers, Albuquerque, New Mexico, ...
-
[36]
Weiyan Shi, Ryan Li, Yutong Zhang, Caleb Ziems, Sunny Yu, Raya Horesh, Rogério de Paula, and Diyi Yang. 2024. CultureBank: An Online Community- Driven Knowledge Base Towards Culturally Aware Language Technologies. In Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024, Yaser Al-Onaizan, Mohit Ba...
2024
-
[37]
William Shiao and Evangelos E. Papalexakis. 2024. Synthetic data for learning- based knowledge discovery.SIGKDD Explor. Newsl.26, 1 (July 2024), 19–23. doi:10.1145/3682112.3682115
-
[38]
Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David Ifeoluwa Ade- lani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Sebastian Ruder, Wei-Yin Ko, Antoine Bosselut, Alice Oh, André F. T. Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadae...
2025
-
[39]
Anders Søgaard, Sebastian Ruder, and Ivan Vulic. 2018. On the Limitations of Unsupervised Bilingual Dictionary Induction. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational L...
2018
-
[40]
Shashank Srivastava. 2025. Large Language Models Threaten Language’s Epis- temic and Communicative Foundations. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Com- putational Linguistics, Suzhou, China, 28650–28...
-
[41]
Ivan Vulic, Edoardo Maria Ponti, Robert Litschko, Goran Glavas, and Anna Korhonen. 2020. Probing Pretrained Language Models for Lexical Semantics. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Com...
-
[42]
Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, and Barbara Plank. 2025. Refusal Direction is Universal Across Safety-Aligned Languages. CoRRabs/2505.17306 (2025). arXiv:2505.17306 doi:10.48550/ARXIV.2505.17306
-
[43]
Shaoyang Xu, Yongqi Leng, Linhao Yu, and Deyi Xiong. 2025. Self-Pluralising Culture Alignment for Large Language Models. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Luis Chiruzzo, Alan Ritter, and Lu Wang (Eds.). Associa...
-
[44]
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. 2025. Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://openreview.net/forum?id...
2025
-
[45]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[46]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024
-
[47]
Haeun Yu, Seogyeong Jeong, Siddhesh Pawar, Jisu Shin, Jiho Jin, Junho Myung, Al- ice Oh, and Isabelle Augenstein. 2025. Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models.CoRRabs/2508.08879 (2025). arXiv:2508.08879 doi:10.48550/ARXIV.2508.08879
-
[48]
Jinghao Zhang, Sihang Jiang, Shiwei Guo, Shisong Chen, Yanghua Xiao, Hong- wei Feng, Jiaqing Liang, Minggui HE, Shimin Tao, and Hongxia Ma. 2025. Cul- tureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs. CoRRabs/2509.16188 (2025). arXiv:2509.16188 doi:10.48550/ARXIV.2509.16188
-
[49]
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, and Yongqiang Ma. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Lan- guage Models.CoRRabs/2403.13372 (2024). arXiv:2403.13372 doi:10.48550/ ARXIV.2403.13372 10 C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment A Manual Annotation Ru...
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.