Toward LLMs Beyond English-Centric Development

Sho Takase; Ukyo Honda

arxiv: 2605.15613 · v1 · pith:MPA3FYP6new · submitted 2026-05-15 · 💻 cs.CL

Toward LLMs Beyond English-Centric Development

Sho Takase , Ukyo Honda This is my paper

Pith reviewed 2026-05-20 19:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLMsEnglish biascontinual pre-trainingcultural understandingmultilingual modelstraining from scratchlanguage adaptationmodel bias

0 comments

The pith

LLMs are heavily biased toward English and continual pre-training offers no cost advantage over training from scratch for cultural understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes sequences generated by open-weight large language models and finds they are heavily biased toward English. It then compares continual pre-training on target language data against training entirely from scratch and concludes the former provides no cost savings, even when the goal is better cultural understanding in non-English languages. This challenges the common strategy of adapting English-dominant models and instead points to the need for dedicated per-language investments. A sympathetic reader would care because the result affects whether current approaches can deliver equitable AI performance across languages without major new resource commitments.

Core claim

Through an analysis of sequences generated by open-weight large language models, we demonstrate that LLMs are heavily biased toward English. While continual pre-training is commonly used to adapt LLMs to a target language, we show that it does not offer a cost advantage over training from scratch, even for improving cultural understanding in the target language. These findings suggest that dedicated per-language investment may become increasingly important for future LLM development, rather than relying primarily on the expansion of English-centric resources.

What carries the argument

Analysis of generated sequences to detect English bias combined with direct cost and performance comparisons between continual pre-training and from-scratch training for target-language adaptation.

If this is right

Open-weight LLMs produce sequences that disproportionately favor English content.
Continual pre-training on target-language data shows no cost advantage for improving cultural understanding.
Future LLM development may require dedicated per-language investments rather than English-centric resource expansion.
Cultural adaptation in non-English languages demands resources comparable to initial model training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that simply scaling English data further will not efficiently close performance gaps in other languages.
Developers could explore building separate models for major languages from the start instead of adaptation pipelines.
The results raise questions about the long-term feasibility of truly multilingual models without language-specific data investments.
It connects to broader efforts in efficient data collection for underrepresented languages.

Load-bearing premise

The sequences generated by the models accurately reflect underlying training biases and the cost and performance comparisons between continual pre-training and from-scratch training are measured under comparable conditions that generalize beyond the tested models and languages.

What would settle it

A continually pre-trained model achieving substantially better cultural understanding in a target language at lower total compute cost than a from-scratch model trained on equivalent data would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.15613 by Sho Takase, Ukyo Honda.

**Figure 1.** Figure 1: Estimated language distribution of pre-training data for each LLM based on randomly generated sequences. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Averaged benchmark scores for each training cost. The vertical arrows in (b) indicate the performance [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Through an analysis of sequences generated by open-weight large language models (LLMs), we demonstrate that LLMs are heavily biased toward English. While continual pre-training is commonly used to adapt LLMs to a target language, we show that it does not offer a cost advantage over training from scratch, even for improving cultural understanding in the target language. These findings suggest that dedicated per-language investment may become increasingly important for future LLM development, rather than relying primarily on the expansion of English-centric resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows clear English bias in open LLMs via generated sequences and claims continual pre-training brings no cost edge over from-scratch training for cultural understanding, but the evidence details are thin.

read the letter

The core finding is that current open-weight LLMs lean heavily English in their outputs, and that continual pre-training does not reduce the compute needed to build cultural understanding in a target language compared with training a model from scratch on that language. This lines up with the abstract's direct comparison and points toward needing separate per-language training runs rather than just scaling English data further. The work is useful for flagging a practical issue in resource allocation for non-English capabilities. The quantified cost comparison adds a specific data point that prior bias discussions have not always included. The sequence analysis also gives a straightforward way to surface the English tilt without heavy new machinery. That said, the abstract gives no numbers on sample sizes, exact metrics for cultural understanding, model choices, or any statistical checks, so it is difficult to judge whether the performance differences are reliable or just noise from small tests. The stress-test point about unmatched total target-language tokens and base-model scale looks like it could matter here; if the from-scratch runs saw less relevant data overall, the cost numbers would not be directly comparable. Without the full methods section it is hard to tell whether those controls were applied. The paper is aimed at groups building or evaluating multilingual models and at people tracking fairness in LLM development. Readers who want concrete numbers on training trade-offs will get something from it, even if the current write-up stays high-level. I would send it to peer review so the authors can supply the missing experimental details and let referees check the cost accounting.

Referee Report

2 major / 1 minor

Summary. The manuscript analyzes sequences generated by open-weight LLMs to demonstrate heavy English-centric bias. It compares continual pre-training against training from scratch for target-language adaptation and reports that the former provides no cost advantage even for cultural understanding tasks, concluding that dedicated per-language resources will be increasingly necessary rather than relying on English-centric expansion.

Significance. If the empirical comparisons hold under matched conditions, the result would directly inform multilingual LLM development practices by questioning the efficiency of continual pre-training for non-English cultural alignment. The direct use of generated-sequence analysis and training-cost comparisons (rather than parameter fitting) provides a falsifiable empirical basis, though the lack of reported controls limits immediate generalizability.

major comments (2)

[Cost and Performance Comparison] The performance-per-compute comparison between continual pre-training and from-scratch training (central to the cost-advantage claim) does not specify whether total target-language tokens seen or base-model scale are matched across regimes. Without such controls, observed differences could arise from unequal exposure rather than inherent cost properties of the two regimes.
[Methodology / Experimental Setup] No details are provided on sample sizes, exact metrics for cultural understanding, model selection criteria, or statistical controls used in the sequence-generation analysis or cost calculations. This absence prevents verification that the data support the stated findings on English bias and cost equivalence.

minor comments (1)

[Abstract] The abstract would be strengthened by naming the specific open-weight models and target languages examined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of experimental controls and methodological transparency in our analysis of English-centric biases in open-weight LLMs and the relative costs of continual pre-training versus from-scratch training. We address each major comment below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses

Referee: [Cost and Performance Comparison] The performance-per-compute comparison between continual pre-training and from-scratch training (central to the cost-advantage claim) does not specify whether total target-language tokens seen or base-model scale are matched across regimes. Without such controls, observed differences could arise from unequal exposure rather than inherent cost properties of the two regimes.

Authors: We agree that explicit confirmation of matched conditions is necessary to support the cost-advantage claim. Our experiments were conducted with equivalent total target-language token exposure across regimes and used base models of comparable scale; however, these details were not stated with sufficient precision in the original submission. In the revision we will add a dedicated subsection describing the exact token counts, model scales, and matching procedure for both the continual pre-training and from-scratch settings, thereby allowing readers to verify that differences are attributable to the training regime rather than unequal exposure. revision: yes
Referee: [Methodology / Experimental Setup] No details are provided on sample sizes, exact metrics for cultural understanding, model selection criteria, or statistical controls used in the sequence-generation analysis or cost calculations. This absence prevents verification that the data support the stated findings on English bias and cost equivalence.

Authors: We acknowledge that the current manuscript lacks the level of methodological detail required for full reproducibility and verification. The revised version will expand the experimental setup to report sample sizes for the generated-sequence analysis, the precise metrics and evaluation protocols used to measure cultural understanding, the criteria applied when selecting the open-weight models, and any statistical controls or significance testing employed in the bias and cost comparisons. These additions will directly enable independent assessment of the reported English bias and cost-equivalence results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations and cost comparisons stand independently

full rationale

The paper's central claims rest on direct analysis of sequences generated by open-weight LLMs and explicit training-cost comparisons between continual pre-training and from-scratch regimes. No equations, fitted parameters, or self-citations are used to define the target quantities or to force the reported outcomes by construction. The work is self-contained against external model outputs and compute measurements; any cited prior results function as background rather than load-bearing premises that collapse the argument.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations of generated sequences and cost comparisons; no free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

axioms (1)

domain assumption Generated sequences from open-weight LLMs reliably indicate training data biases.
Invoked to interpret the analysis results as evidence of English bias.

pith-pipeline@v0.9.0 · 5592 in / 1127 out tokens · 40081 ms · 2026-05-20T19:47:20.903113+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

performance on cultural understanding related to a specific language is proportional to the logarithmic scale of the training cost for that language, rather than to the total training cost
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the cost-performance trend of continual pre-training follows approximately the same scaling as training from scratch for the target language

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 9 internal anchors

[1]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, and 29 others. 2023. https://arxiv.org/abs/2309.16609 Qwen technical report

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, and 12 others. 2020. Language models are few-shot learners. I...

work page 2020
[3]

Nicholas Carlini, Florian Tram \`e r, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, \'U lfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633--2650

work page 2021
[4]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Yiming Cui, Ziqing Yang, and Xin Yao. 2024. https://arxiv.org/abs/2304.08177 Efficient and effective text encoding for chinese llama and alpaca

work page arXiv 2024
[6]

Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, and Naoaki Okazaki. 2024. Continual pre-training for cross-lingual LLM adaptation: Enhancing japanese language capabilities. In First Conference on Language Modeling (COLM)

work page 2024
[7]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC)

work page 2018
[9]

Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, and 24 others. 2024. OLM o: Accelerating the science of language models. In Proceedin...

work page 2024
[10]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Kar\' e n Simonyan, Erich Elsen, and 3 others. 2022. An empirical a...

work page 2022
[11]

Jafar Isbarov, Arofat Akhundjanova, Mammad Hajili, Kavsar Huseynova, Dmitry Gaynullin, Anar Rzayev, Osman Tursun, Aizirek Turdubaeva, Ilshat Saetov, Rinat Kharisov, Saule Belginova, Ariana Kenbayeva, Amina Alisheva, Abdullatif K \"o ksal, Samir Rustamov, and Duygu Ataman. 2025. TUMLU : A unified and native language understanding benchmark for T urkic lang...

work page 2025
[12]

Maor Ivgi, Yair Carmon, and Jonathan Berant. 2022. Scaling laws under the microscope: Predicting transformer performance from small scale experiments. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7354--7371

work page 2022
[13]

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. https://arxiv.org/abs/1612.03651 Fasttext.zip: Compressing text classification models

work page internal anchor Pith review Pith/arXiv arXiv 2016
[14]

Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models

work page 2020
[15]

Maurice Kendall. 1938. A new measure of rank correlation. Biometrika, 30(1/2):81--93

work page 1938
[16]

Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Baldwin. 2023. Large language models only pass primary school exams in I ndonesia: A comprehensive test on I ndo MMLU . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 12359--12374

work page 2023
[17]

Fajri Koto, Haonan Li, Sara Shatnawi, Jad Doughman, Abdelrahman Sadallah, Aisha Alraeesi, Khalid Almubarak, Zaid Alyafeai, Neha Sengupta, Shady Shehata, Nizar Habash, Preslav Nakov, and Timothy Baldwin. 2024. A rabic MMLU : Assessing massive multitask language understanding in A rabic. In Findings of the Association for Computational Linguistics: ACL 2024...

work page 2024
[18]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

work page 2023
[19]

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. 2024. CMMLU : Measuring massive multitask language understanding in C hinese. In Findings of the Association for Computational Linguistics: ACL 2024, pages 11260--11285

work page 2024
[20]

Seng Pei Liew, Takuya Kato, and Sho Takase. 2025. Scaling laws for upcycling mixture-of-experts language models. In Forty-second International Conference on Machine Learning (ICML)

work page 2025
[21]

LLM-jp. 2024. https://arxiv.org/abs/2407.03963 Llm-jp: A cross-organizational project for the research and development of fully open japanese llms

work page arXiv 2024
[22]

David Orlando Romero Mogrovejo, Chenyang Lyu, Haryo Akbarianto Wibowo, Santiago G \'o ngora, Aishik Mandal, Sukannya Purkayastha, Jesus-German Ortiz-Barajas, Emilio Villa Cueva, Jinheon Baek, Soyeong Jeong, Injy Hamed, Zheng Xin Yong, Zheng Wei Lim, Paula M \'o nica Silva, Jocelyn Dunstan, M \'e lanie Jouitteau, David LE MEUR, Joan Nwatu, Ganzorig Batnasa...

work page 2024
[23]

Feder Cooper, Daphne Ippolito, Christopher A

Milad Nasr, Javier Rando, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Florian Tram \`e r, and Katherine Lee. 2025. Scalable extraction of training data from aligned, production language models. In The Thirteenth International Conference on Learning Representations (ICLR)

work page 2025
[24]

Ramon Pires, Hugo Abonizio, Thales Sales Almeida, and Rodrigo Nogueira. 2023. Sabi \'a : Portuguese large language models. In Intelligent Systems, pages 226--240

work page 2023
[25]

Satoshi Sekine. 2003. Development of a question answering system focused on an encyclopedia (in japanese). In 9th Annual Meeting of the Association for Natural Language Processing

work page 2003
[26]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations (ICLR)

work page 2017
[27]

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. 2023. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations (ICLR)

work page 2023
[28]

Shivalika Singh, Angelika Romanou, Cl \'e mentine Fourrier, David Ifeoluwa Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Sebastian Ruder, Wei-Yin Ko, Antoine Bosselut, Alice Oh, Andre Martins, Leshem Choshen, Daphne Ippolito, and 4 others. 2025. Global MMLU ...

work page 2025
[29]

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, and 17 others. 2024. Dolma: an open corpus of three trillion tokens for lan...

work page 2024
[30]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://arxiv.org/abs/2503.19786...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, and 89 others. 2024. https://arxiv.org/abs/2403.0829...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Mukhammed Togmanov, Nurdaulet Mukhituly, Diana Turmakhan, Jonibek Mansurov, Maiya Goloburda, Akhmed Sakip, Zhuohan Xie, Yuxia Wang, Bekassyl Syzdykov, Nurkhan Laiyk, Alham Fikri Aji, Ekaterina Kochmar, Preslav Nakov, and Fajri Koto. 2025. K az MMLU : Evaluating language models on K azakh, R ussian, and regional knowledge of K azakhstan. In Proceedings of ...

work page 2025
[33]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023 a . https://arxiv.org/abs/2302.13971 Llama: Open and efficient foundation language models

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, and 49 others. 2023 b . https://arxiv.org/abs/2307.09288 Llama 2: Open...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, Noor Ahsan, Nevasini Sasikumar, Omkar Thawakar, Henok Biadglign Ademtew, Yahya Hmaiti, Amandeep Kumar, Kartik Kuckreja, Mykola Maslych, Wafa Al Ghallabi, Mihail Mihaylov, Chao Qin, Abdelrahman M Shaker, Mike Zhang, Mahardika Krisna Ihsani, Amiel Esplana, Monil Gokani, and 50 others. 2024. https://arxiv...

work page arXiv 2024
[36]

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. 2024. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In Advances in Neural Information Processing Systems (N...

work page 2024
[37]

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. 2025. Magpie: Alignment data synthesis from scratch by prompting aligned LLM s with nothing. In The Thirteenth International Conference on Learning Representations (ICLR)

work page 2025
[38]

Weihao Xuan, Rui Yang, Heli Qi, Qingcheng Zeng, Yunze Xiao, Aosong Feng, Dairui Liu, Yun Xing, Junjue Wang, Fan Gao, Jinghui Lu, Yuang Jiang, Huitao Li, Xin Li, Kunyu Yu, Ruihai Dong, Shangding Gu, Yuekang Li, Xiaofei Xie, and 13 others. 2025. MMLU - P ro X : A multilingual benchmark for advanced large language model evaluation. In Proceedings of the 2025...

work page 2025
[39]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 23 others. 2025. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Wen Yang, Chong Li, Jiajun Zhang, and Chengqing Zong. 2023. https://arxiv.org/abs/2305.18098 Bigtranslate: Augmenting large language models with multilingual translation capability over 100 languages

work page arXiv 2023
[41]

Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, and Satoshi Sekine. 2024. Should we respect LLM s? a cross-lingual study on the influence of prompt politeness on LLM performance. In Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024), pages 9--35

work page 2024
[42]

Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui, and Xuanjing Huang. 2024. https://arxiv.org/abs/2401.01055 Llama beyond english: An empirical study on language capability transfer

work page arXiv 2024
[43]

Chengzhi Zhong, Qianying Liu, Fei Cheng, Junfeng Jiang, Zhen Wan, Chenhui Chu, Yugo Murawaki, and Sadao Kurohashi. 2025. What language do non- E nglish-centric large language models think in? In Findings of the Association for Computational Linguistics: ACL 2025, pages 26333--26346

work page 2025

[1] [1]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, and 29 others. 2023. https://arxiv.org/abs/2309.16609 Qwen technical report

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, and 12 others. 2020. Language models are few-shot learners. I...

work page 2020

[3] [3]

Nicholas Carlini, Florian Tram \`e r, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, \'U lfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633--2650

work page 2021

[4] [4]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Yiming Cui, Ziqing Yang, and Xin Yao. 2024. https://arxiv.org/abs/2304.08177 Efficient and effective text encoding for chinese llama and alpaca

work page arXiv 2024

[6] [6]

Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, and Naoaki Okazaki. 2024. Continual pre-training for cross-lingual LLM adaptation: Enhancing japanese language capabilities. In First Conference on Language Modeling (COLM)

work page 2024

[7] [7]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC)

work page 2018

[9] [9]

Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, and 24 others. 2024. OLM o: Accelerating the science of language models. In Proceedin...

work page 2024

[10] [10]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Kar\' e n Simonyan, Erich Elsen, and 3 others. 2022. An empirical a...

work page 2022

[11] [11]

Jafar Isbarov, Arofat Akhundjanova, Mammad Hajili, Kavsar Huseynova, Dmitry Gaynullin, Anar Rzayev, Osman Tursun, Aizirek Turdubaeva, Ilshat Saetov, Rinat Kharisov, Saule Belginova, Ariana Kenbayeva, Amina Alisheva, Abdullatif K \"o ksal, Samir Rustamov, and Duygu Ataman. 2025. TUMLU : A unified and native language understanding benchmark for T urkic lang...

work page 2025

[12] [12]

Maor Ivgi, Yair Carmon, and Jonathan Berant. 2022. Scaling laws under the microscope: Predicting transformer performance from small scale experiments. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7354--7371

work page 2022

[13] [13]

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. https://arxiv.org/abs/1612.03651 Fasttext.zip: Compressing text classification models

work page internal anchor Pith review Pith/arXiv arXiv 2016

[14] [14]

Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models

work page 2020

[15] [15]

Maurice Kendall. 1938. A new measure of rank correlation. Biometrika, 30(1/2):81--93

work page 1938

[16] [16]

Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Baldwin. 2023. Large language models only pass primary school exams in I ndonesia: A comprehensive test on I ndo MMLU . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 12359--12374

work page 2023

[17] [17]

Fajri Koto, Haonan Li, Sara Shatnawi, Jad Doughman, Abdelrahman Sadallah, Aisha Alraeesi, Khalid Almubarak, Zaid Alyafeai, Neha Sengupta, Shady Shehata, Nizar Habash, Preslav Nakov, and Timothy Baldwin. 2024. A rabic MMLU : Assessing massive multitask language understanding in A rabic. In Findings of the Association for Computational Linguistics: ACL 2024...

work page 2024

[18] [18]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

work page 2023

[19] [19]

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. 2024. CMMLU : Measuring massive multitask language understanding in C hinese. In Findings of the Association for Computational Linguistics: ACL 2024, pages 11260--11285

work page 2024

[20] [20]

Seng Pei Liew, Takuya Kato, and Sho Takase. 2025. Scaling laws for upcycling mixture-of-experts language models. In Forty-second International Conference on Machine Learning (ICML)

work page 2025

[21] [21]

LLM-jp. 2024. https://arxiv.org/abs/2407.03963 Llm-jp: A cross-organizational project for the research and development of fully open japanese llms

work page arXiv 2024

[22] [22]

David Orlando Romero Mogrovejo, Chenyang Lyu, Haryo Akbarianto Wibowo, Santiago G \'o ngora, Aishik Mandal, Sukannya Purkayastha, Jesus-German Ortiz-Barajas, Emilio Villa Cueva, Jinheon Baek, Soyeong Jeong, Injy Hamed, Zheng Xin Yong, Zheng Wei Lim, Paula M \'o nica Silva, Jocelyn Dunstan, M \'e lanie Jouitteau, David LE MEUR, Joan Nwatu, Ganzorig Batnasa...

work page 2024

[23] [23]

Feder Cooper, Daphne Ippolito, Christopher A

Milad Nasr, Javier Rando, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Florian Tram \`e r, and Katherine Lee. 2025. Scalable extraction of training data from aligned, production language models. In The Thirteenth International Conference on Learning Representations (ICLR)

work page 2025

[24] [24]

Ramon Pires, Hugo Abonizio, Thales Sales Almeida, and Rodrigo Nogueira. 2023. Sabi \'a : Portuguese large language models. In Intelligent Systems, pages 226--240

work page 2023

[25] [25]

Satoshi Sekine. 2003. Development of a question answering system focused on an encyclopedia (in japanese). In 9th Annual Meeting of the Association for Natural Language Processing

work page 2003

[26] [26]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations (ICLR)

work page 2017

[27] [27]

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. 2023. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations (ICLR)

work page 2023

[28] [28]

Shivalika Singh, Angelika Romanou, Cl \'e mentine Fourrier, David Ifeoluwa Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Sebastian Ruder, Wei-Yin Ko, Antoine Bosselut, Alice Oh, Andre Martins, Leshem Choshen, Daphne Ippolito, and 4 others. 2025. Global MMLU ...

work page 2025

[29] [29]

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, and 17 others. 2024. Dolma: an open corpus of three trillion tokens for lan...

work page 2024

[30] [30]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://arxiv.org/abs/2503.19786...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, and 89 others. 2024. https://arxiv.org/abs/2403.0829...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Mukhammed Togmanov, Nurdaulet Mukhituly, Diana Turmakhan, Jonibek Mansurov, Maiya Goloburda, Akhmed Sakip, Zhuohan Xie, Yuxia Wang, Bekassyl Syzdykov, Nurkhan Laiyk, Alham Fikri Aji, Ekaterina Kochmar, Preslav Nakov, and Fajri Koto. 2025. K az MMLU : Evaluating language models on K azakh, R ussian, and regional knowledge of K azakhstan. In Proceedings of ...

work page 2025

[33] [33]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023 a . https://arxiv.org/abs/2302.13971 Llama: Open and efficient foundation language models

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, and 49 others. 2023 b . https://arxiv.org/abs/2307.09288 Llama 2: Open...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, Noor Ahsan, Nevasini Sasikumar, Omkar Thawakar, Henok Biadglign Ademtew, Yahya Hmaiti, Amandeep Kumar, Kartik Kuckreja, Mykola Maslych, Wafa Al Ghallabi, Mihail Mihaylov, Chao Qin, Abdelrahman M Shaker, Mike Zhang, Mahardika Krisna Ihsani, Amiel Esplana, Monil Gokani, and 50 others. 2024. https://arxiv...

work page arXiv 2024

[36] [36]

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. 2024. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In Advances in Neural Information Processing Systems (N...

work page 2024

[37] [37]

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. 2025. Magpie: Alignment data synthesis from scratch by prompting aligned LLM s with nothing. In The Thirteenth International Conference on Learning Representations (ICLR)

work page 2025

[38] [38]

Weihao Xuan, Rui Yang, Heli Qi, Qingcheng Zeng, Yunze Xiao, Aosong Feng, Dairui Liu, Yun Xing, Junjue Wang, Fan Gao, Jinghui Lu, Yuang Jiang, Huitao Li, Xin Li, Kunyu Yu, Ruihai Dong, Shangding Gu, Yuekang Li, Xiaofei Xie, and 13 others. 2025. MMLU - P ro X : A multilingual benchmark for advanced large language model evaluation. In Proceedings of the 2025...

work page 2025

[39] [39]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 23 others. 2025. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Wen Yang, Chong Li, Jiajun Zhang, and Chengqing Zong. 2023. https://arxiv.org/abs/2305.18098 Bigtranslate: Augmenting large language models with multilingual translation capability over 100 languages

work page arXiv 2023

[41] [41]

Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, and Satoshi Sekine. 2024. Should we respect LLM s? a cross-lingual study on the influence of prompt politeness on LLM performance. In Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024), pages 9--35

work page 2024

[42] [42]

Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui, and Xuanjing Huang. 2024. https://arxiv.org/abs/2401.01055 Llama beyond english: An empirical study on language capability transfer

work page arXiv 2024

[43] [43]

Chengzhi Zhong, Qianying Liu, Fei Cheng, Junfeng Jiang, Zhen Wan, Chenhui Chu, Yugo Murawaki, and Sadao Kurohashi. 2025. What language do non- E nglish-centric large language models think in? In Findings of the Association for Computational Linguistics: ACL 2025, pages 26333--26346

work page 2025