DEPART: DEcomposing PARiTy across Multilingual LLMs

Himanshu Beniwal; Manan Uppadhyay; Pranjal Chitale; Prashant Kodali; Reshma Ramaprasad; Sunayana Sitaram

arxiv: 2605.28163 · v1 · pith:RZUDR4WDnew · submitted 2026-05-27 · 💻 cs.CL · cs.AI

DEPART: DEcomposing PARiTy across Multilingual LLMs

Manan Uppadhyay , Prashant Kodali , Pranjal Chitale , Reshma Ramaprasad , Himanshu Beniwal , Sunayana Sitaram This is my paper

Pith reviewed 2026-06-29 12:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multilingual LLMsperformance variance decompositionBayesian hierarchical modelinglanguage featuresrepresentational similarityNLU vs reasoning

0 comments

The pith

Observable language features explain 79% of variance on understanding tasks and 92% on reasoning in multilingual LLMs, with internal similarity to English as the strongest predictor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first confirms that performance gaps across languages in mLLMs are systematic rather than random using distribution-free statistical tests. It then applies a two-step Bayesian hierarchical decomposition to separate variance due to language identity from other factors. Observable traits such as script, language family, and typological distance account for most of the language-level variance, while a model's representational closeness to English stands out as the main driver. The framework also shows that model identity explains the bulk of variance in understanding tasks, but benchmark-model interactions dominate in reasoning tasks. This turns multilingual evaluation into a diagnostic tool that points to concrete causes of disparity.

Core claim

Observable language features explain R²_ling = 79% of language-identity variance on understanding tasks and 92% on reasoning tasks, with a model's internal representational similarity to English as the dominant predictor in both cases; model identity accounts for 66.7% of overall variance in understanding while the benchmark-by-model interaction accounts for 46.3% in reasoning.

What carries the argument

two-step Bayesian hierarchical framework that first isolates variance attributable to language identity and then decomposes the full model-by-benchmark-by-language cube

If this is right

Evaluation can move from reporting raw per-language scores to attributing gaps to measurable language properties and model internals.
Improving a model's internal alignment with English representations should narrow gaps more effectively than other interventions for both understanding and reasoning.
Different task types require different mitigation strategies because their dominant variance sources differ.
New languages can be placed on existing performance curves once their script, family, typological distance, and representational similarity are measured.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition could be applied to generation or translation tasks to test whether the same language features remain dominant.
If representational similarity to English is the strongest lever, then targeted continued pretraining on high-similarity languages might lift low-performing ones more efficiently than broad multilingual data scaling.
Benchmark designers could use the variance profiles to create more balanced test suites that reduce the benchmark-by-model interaction term.

Load-bearing premise

The two-step decomposition isolates genuine attributable variance components rather than reflecting artifacts from how the mLLMs were trained or how the performance numbers were collected.

What would settle it

A regression of task performance on the listed language features and representational similarity measures that fails to reduce residual variance by approximately the reported R² amounts, or that shows a different dominant predictor after the same controls.

Figures

Figures reproduced from arXiv: 2605.28163 by Himanshu Beniwal, Manan Uppadhyay, Pranjal Chitale, Prashant Kodali, Reshma Ramaprasad, Sunayana Sitaram.

**Figure 2.** Figure 2: The three nested model specifications fit in this work, in [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

read the original abstract

Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving systemic biases unattributed and offering practitioners no actionable levers. We first establish that these gaps are systematic rather than artifacts of sampling noise via distribution-free Friedman and Kruskal--Wallis tests, then introduce a two-step Bayesian hierarchical framework that decomposes multilingual performance variance into interpretable components. First, isolating the variance attributable to language identity, we show that observable language features (script, family, typological distance) explain $R^2_{\text{ling}} = 79\%$ of this variance on understanding tasks and $92\%$ on reasoning, with a model's internal representational similarity to English emerging as the dominant predictor across both task buckets. Second, decomposing the full (model$\times$benchmark$\times$language) cube, we find that NLU and reasoning have fundamentally divergent variance profiles: model identity dominates understanding ($66.7\%$ of variance), whereas the benchmark$\times$model interaction dominates reasoning ($46.3\%$). Together these results recast multilingual evaluation from passive performance mapping into an explainable, diagnostic framework with concrete levers for targeting the root drivers of language disparity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's Bayesian variance decomposition for mLLM gaps is a fresh framing but rests on modeling choices whose robustness to training data confounders cannot be checked from the given details.

read the letter

This paper introduces a two-step Bayesian hierarchical framework to decompose performance variance across languages, models, and benchmarks in multilingual LLMs. It first confirms systematic gaps with Friedman and Kruskal-Wallis tests, then attributes most of the language-level variance to features like script, family, and typological distance (R² of 79% on understanding tasks, 92% on reasoning), while also splitting the full data cube to show model identity dominating NLU variance at 66.7% versus benchmark-model interaction at 46.3% for reasoning.

The work does a few things cleanly. The initial non-parametric tests are distribution-free and independent of the later model, which is a sensible way to establish that gaps are not noise. The split between NLU and reasoning variance profiles is a useful diagnostic distinction that goes beyond standard per-language accuracy tables and could point practitioners toward different fixes for different task types.

The soft spots are mainly around the decomposition step itself. The abstract supplies no model equations, prior specifications, data exclusion rules, or validation for the reported R² and percentage attributions. Without those, it is difficult to tell whether the high language-feature attributions reflect genuine drivers or simply echo imbalances already present in the mLLMs' training corpora, such as uneven token counts or script coverage. The stress-test concern about confounding paths is reasonable on the evidence shown.

This is aimed at researchers who evaluate or improve multilingual models and want more than leaderboard numbers. A reader focused on fairness or diagnostic tools might extract some value from the framing even if the exact numbers require further checks. It deserves a serious referee because the core idea of structured variance attribution is worth testing with full methods and robustness checks supplied, though the current version would likely face questions on the Bayesian component.

Referee Report

2 major / 1 minor

Summary. The paper claims that performance disparities across languages in multilingual LLMs are systematic (established via Friedman and Kruskal-Wallis tests) rather than noise. It introduces a two-step Bayesian hierarchical decomposition: first isolating language-identity variance and showing that observable features (script, family, typological distance) plus representational similarity to English explain R²_ling = 79% on understanding tasks and 92% on reasoning; second, decomposing the full model×benchmark×language cube to reveal that model identity accounts for 66.7% of variance in understanding while the benchmark×model interaction accounts for 46.3% in reasoning. The work positions this as a diagnostic framework with actionable levers for disparity.

Significance. If the decomposition is valid and unconfounded, the results would meaningfully advance multilingual evaluation by shifting from leaderboard reporting to attribution of root causes, identifying concrete targets such as representational alignment and benchmark design. The initial distribution-free tests are a clear strength for establishing non-random gaps independent of the later modeling step.

major comments (2)

[§3 (two-step Bayesian hierarchical framework)] Two-step Bayesian hierarchical framework (abstract and §3): the central R²_ling = 79%/92% claims and the subsequent 66.7%/46.3% attributions rest on the hierarchical model isolating variance components attributable to language features versus training-corpus confounders. No model equations, prior specifications, or conditional-independence assumptions are provided, so it is impossible to verify whether paths from token-count imbalances or script coverage in pretraining data are blocked.
[abstract and decomposition results] Decomposition results (abstract): the claim that observable language features explain the stated percentages of language-identity variance requires that the first step conditions on the second-step cube decomposition. Without explicit variance-component equations or sensitivity checks to training-data imbalances, the reported percentages could be artifacts of how the performance data were generated rather than true attributable shares.

minor comments (1)

[abstract] Notation for R²_ling and the variance percentages should be defined with explicit reference to the hierarchical model components to avoid ambiguity between the language-only step and the full cube decomposition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in our Bayesian framework. We address each major comment below and will revise the manuscript to include the requested specifications, equations, and checks.

read point-by-point responses

Referee: [§3 (two-step Bayesian hierarchical framework)] Two-step Bayesian hierarchical framework (abstract and §3): the central R²_ling = 79%/92% claims and the subsequent 66.7%/46.3% attributions rest on the hierarchical model isolating variance components attributable to language features versus training-corpus confounders. No model equations, prior specifications, or conditional-independence assumptions are provided, so it is impossible to verify whether paths from token-count imbalances or script coverage in pretraining data are blocked.

Authors: We agree that the current manuscript lacks the explicit model equations, prior specifications, and conditional-independence assumptions needed to verify isolation from training-corpus confounders. The revised version will add the full mathematical specification of the two-step Bayesian hierarchical model, including all equations, the priors employed, and a discussion of how the model addresses (or blocks) paths from pretraining statistics such as token counts and script coverage. revision: yes
Referee: [abstract and decomposition results] Decomposition results (abstract): the claim that observable language features explain the stated percentages of language-identity variance requires that the first step conditions on the second-step cube decomposition. Without explicit variance-component equations or sensitivity checks to training-data imbalances, the reported percentages could be artifacts of how the performance data were generated rather than true attributable shares.

Authors: The first step operates on language-identity variance extracted from the full cube, while the second step decomposes the complete model×benchmark×language structure; the steps are sequential rather than the first conditioning on the second. We acknowledge the absence of explicit variance-component equations and sensitivity checks. The revision will supply the variance-component equations, clarify the sequential structure, and incorporate sensitivity analyses to training-data imbalances to demonstrate that the reported R² values are robust. revision: yes

Circularity Check

0 steps flagged

No significant circularity in Bayesian variance decomposition

full rationale

The derivation begins with distribution-free Friedman and Kruskal-Wallis tests to establish non-random performance gaps, followed by a two-step Bayesian hierarchical model whose R²_ling figures are the direct fitted outputs measuring how much variance in those gaps is associated with external observables (script, family, typological distance, representational similarity). These observables are defined independently of the performance cube; the model does not redefine them in terms of the target R², nor does any prediction reduce to a fitted parameter by construction. No self-citation, uniqueness theorem, or ansatz is invoked as load-bearing support for the decomposition itself. The reported percentages are therefore ordinary statistical summaries, not circular reductions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the Bayesian hierarchical variance decomposition and the assumption that language features can be treated as explanatory variables in the model.

free parameters (1)

variance components in the hierarchical model
The Bayesian framework fits multiple variance parameters to attribute portions of the performance cube.

axioms (1)

domain assumption Performance gaps are systematic rather than sampling noise, as established by distribution-free Friedman and Kruskal-Wallis tests
The paper invokes these tests before proceeding to the decomposition step.

pith-pipeline@v0.9.1-grok · 5766 in / 1196 out tokens · 37323 ms · 2026-06-29T12:58:19.769539+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 35 canonical work pages · 1 internal anchor

[1]

David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba Oluwadara Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, Chiamaka Ijeoma Chukwuneke, Happy Buzaaba, Blessing Kudzaishe Sibanda, Godson Koffi Kalipe, Jonathan Mukiibi, Salomon Kabongo Kabenamualu, Foutse Yuehgoh, Mmasibidi Setaka, Lolwe...

work page doi:10.18653/v1/2025.naacl-long.139 2025
[2]

Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David Mortensen, Noah Smith, and Yulia Tsvetkov. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.614 Do all languages cost the same? tokenization in the era of commercial language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904--9923, ...

work page doi:10.18653/v1/2023.emnlp-main.614 2023
[3]

Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Mohamed Ahmed, Kalika Bali, and Sunayana Sitaram. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.258 MEGA : Multilingual evaluation of generative AI . In Proceedings of the 2023 Conference on Empirical Methods in Natural ...

work page doi:10.18653/v1/2023.emnlp-main.258 2023
[4]

Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Mohamed Ahmed, Kalika Bali, and Sunayana Sitaram. 2024. https://doi.org/10.18653/v1/2024.naacl-long.143 MEGAVERSE : Benchmarking large language models across languages, modalities, models and tasks . In Proceedings of the 2024 Confere...

work page doi:10.18653/v1/2024.naacl-long.143 2024
[5]

David Anugraha, Genta Indra Winata, Chenyue Li, Patrick Amadeus Irawan, and En-Shiun Annie Lee. 2025. https://doi.org/10.18653/v1/2025.findings-naacl.106 P roxy LM : Predicting language model performance on multilingual tasks via proxy models . In Findings of the Association for Computational Linguistics: NAACL 2025, pages 1981--2011, Albuquerque, New Mex...

work page doi:10.18653/v1/2025.findings-naacl.106 2025
[6]

Akari Asai, Sneha Kudugunta, Xinyan Yu, Terra Blevins, Hila Gonen, Machel Reid, Yulia Tsvetkov, Sebastian Ruder, and Hannaneh Hajishirzi. 2024. https://doi.org/10.18653/v1/2024.naacl-long.100 BUFFET : Benchmarking large language models for few-shot cross-lingual transfer . In Proceedings of the 2024 Conference of the North American Chapter of the Associat...

work page doi:10.18653/v1/2024.naacl-long.100 2024
[7]

Baayen, D.J

R.H. Baayen, D.J. Davidson, and D.M. Bates. 2008. https://doi.org/10.1016/j.jml.2007.12.005 Mixed-effects modeling with crossed random effects for subjects and items . Journal of Memory and Language, 59(4):390--412. Special Issue: Emerging Data Analysis

work page doi:10.1016/j.jml.2007.12.005 2008
[8]

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. 2024. https://doi.org/10.18653/v1/2024.acl-long.44 The belebele benchmark: a parallel reading comprehension dataset in 122 language variants . In Proceedings of the 62nd Annual Meeting of t...

work page doi:10.18653/v1/2024.acl-long.44 2024
[9]

Yoav Benjamini and Yosef Hochberg. 1995. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x Controlling the false discovery rate: A practical and powerful approach to multiple testing . Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289--300

work page doi:10.1111/j.2517-6161.1995.tb02031.x 1995
[10]

Damian Blasi, Antonios Anastasopoulos, and Graham Neubig. 2022. https://doi.org/10.18653/v1/2022.acl-long.376 Systematic inequalities in language technology performance across the world ' s languages . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486--5505, Dublin, Ireland. Asso...

work page doi:10.18653/v1/2022.acl-long.376 2022
[11]

Terra Blevins and Luke Zettlemoyer. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.233 Language contamination helps explains the cross-lingual capabilities of E nglish pretrained models . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3563--3574, Abu Dhabi, United Arab Emirates. Association for Computat...

work page doi:10.18653/v1/2022.emnlp-main.233 2022
[12]

Monojit Choudhury and Amit Deshpande. 2021. How linguistically fair are multilingual pre-trained language models? In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 12710--12718

2021
[13]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://doi.org/10.18653/v1/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for Comp...

work page doi:10.18653/v1/2020.acl-main.747 2020
[14]

Bowman, Holger Schwenk, and Veselin Stoyanov

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. https://doi.org/10.18653/v1/D18-1269 XNLI : Evaluating cross-lingual sentence representations . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475--2485, Brussels, Belgium. Associat...

work page doi:10.18653/v1/d18-1269 2018
[15]

Marta R. Costa-juss \`a , James Cross, Onur C elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, and 20 others. 2024. https://doi.org/10.1038/s41586-024-0...

work page doi:10.1038/s41586-024-07335-x 2024
[16]

Marta R Costa-Juss \`a , James Cross, Onur C elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, and 1 others. 2022. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Cl \'e ment Dumas, Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. 2025. https://doi.org/10.18653/v1/2025.acl-long.1536 Separating tongue from thought: Activation patching reveals language-agnostic concept representations in transformers . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volu...

work page doi:10.18653/v1/2025.acl-long.1536 2025
[18]

Olive Jean Dunn. 1964. http://www.jstor.org/stable/1266041 Multiple comparisons using rank sums . Technometrics, 6(3):241--252

work page arXiv 1964
[19]

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others. 2024. https://doi.org/10.5281/zenodo.12608602 The languag...

work page doi:10.5281/zenodo.12608602 2024
[20]

Songbo Hu, Ivan Vuli \'c , and Anna Korhonen. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.199 Quantifying language disparities in multilingual large language models . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4003--4018, Suzhou, China. Association for Computational Linguistics

work page doi:10.18653/v1/2025.emnlp-main.199 2025
[21]

Songbo Hu, Han Zhou, Moy Yuan, Milan Gritta, Guchun Zhang, Ignacio Iacobacci, Anna Korhonen, and Ivan Vuli \'c . 2023. https://doi.org/10.18653/v1/2023.emnlp-main.422 A systematic study of performance disparities in multilingual task-oriented dialogue systems . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, page...

work page doi:10.18653/v1/2023.emnlp-main.422 2023
[22]

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. https://doi.org/10.18653/v1/2020.acl-main.560 The state and fate of linguistic diversity and inclusion in the NLP world . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282--6293, Online. Association for Computational...

work page doi:10.18653/v1/2020.acl-main.560 2020
[23]

Maurice G Kendall and B Babington Smith. 1939. The problem of m rankings. The annals of mathematical statistics, 10(3):275--287

1939
[24]

Simran Khanuja, Sebastian Ruder, and Partha Talukdar. 2023. https://doi.org/10.18653/v1/2023.findings-eacl.131 Evaluating the diversity, equity, and inclusion of NLP technology: A case study for I ndian languages . In Findings of the Association for Computational Linguistics: EACL 2023, pages 1763--1777, Dubrovnik, Croatia. Association for Computational L...

work page doi:10.18653/v1/2023.findings-eacl.131 2023
[25]

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. https://proceedings.mlr.press/v97/kornblith19a.html Similarity of neural network representations revisited . In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3519--3529. PMLR

2019
[26]

Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Baldwin. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.760 Large language models only pass primary school exams in I ndonesia: A comprehensive test on I ndo MMLU . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12359--12374, Singapore. Association for Co...

work page doi:10.18653/v1/2023.emnlp-main.760 2023
[27]

Anne Lauscher, Vinit Ravishankar, Ivan Vuli \'c , and Goran Glava s . 2020. https://doi.org/10.18653/v1/2020.emnlp-main.363 From zero to hero: O n the limitations of zero-shot language transfer with multilingual T ransformers . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4483--4499, Online. Asso...

work page doi:10.18653/v1/2020.emnlp-main.363 2020
[28]

Zihao Li, Yucheng Shi, Zirui Liu, Fan Yang, Ali Payani, Ninghao Liu, and Mengnan Du. 2025. https://doi.org/10.1609/aaai.v39i27.35038 Language ranker: A metric for quantifying llm performance across high and low-resource languages . Proceedings of the AAAI Conference on Artificial Intelligence, 39(27):28186–28194

work page doi:10.1609/aaai.v39i27.35038 2025
[29]

Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin

Patrick Littell, David R. Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin. 2017. https://aclanthology.org/E17-2002/ URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors . In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume ...

2017
[30]

Jessica Ojo, Odunayo Ogundepo, Akintunde Oladipo, Kelechi Ogueji, Jimmy Lin, Pontus Stenetorp, and David Ifeoluwa Adelani. 2025. https://doi.org/10.18653/v1/2025.findings-acl.976 A fro B ench: How good are large language models on A frican languages? In Findings of the Association for Computational Linguistics: ACL 2025, pages 19048--19095, Vienna, Austri...

work page doi:10.18653/v1/2025.findings-acl.976 2025
[31]

Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. 2023. https://openreview.net/forum?id=78yDLKi95p Language model tokenizers introduce unfairness between languages . In Thirty-seventh Conference on Neural Information Processing Systems

2023
[32]

Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability . In Advances in Neural Information Processing Systems, volume 30. Curran ...

2017
[33]

Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, and 38 others

Angelika Romanou, Negar Foroutan, Anna Sotnikova, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Zeming Chen, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, and 38 others. 2025. https://openreview.net/f...

2025
[34]

Phillip Rust, Jonas Pfeiffer, Ivan Vuli \'c , Sebastian Ruder, and Iryna Gurevych. 2021. https://doi.org/10.18653/v1/2021.acl-long.243 How good is your tokenizer? on the monolingual performance of multilingual language models . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confe...

work page doi:10.18653/v1/2021.acl-long.243 2021
[35]

Amartya Sen. 1976. https://doi.org/10.2307/2296597 Real national income . The Review of Economic Studies, 43(1):19--39

work page doi:10.2307/2296597 1976
[36]

Shivalika Singh, Angelika Romanou, Cl \'e mentine Fourrier, David Ifeoluwa Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Sebastian Ruder, Wei-Yin Ko, Antoine Bosselut, Alice Oh, Andre Martins, Leshem Choshen, Daphne Ippolito, and 4 others. 2025. https://doi....

work page doi:10.18653/v1/2025.acl-long.919 2025
[37]

Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann Railey Montalan, Jian Gang Ngui, Xianbin Yong, Wei Qi Leong, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Yifan Mai, and William Chandra Tjhi. 2025. https://doi.org/10.18653/v1/2025.findings-acl.636 SEA - HELM : S outheast A sian holistic evaluation of language models . In Findings of the Associati...

work page doi:10.18653/v1/2025.findings-acl.636 2025
[38]

Dennis Ulmer, Elisa Bassignana, Max M \"u ller-Eberstein, Daniel Varab, Mike Zhang, Rob van der Goot, Christian Hardmeier, and Barbara Plank. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.196 Experimental standards for deep learning in natural language processing research . In Findings of the Association for Computational Linguistics: EMNLP 2022, ...

work page doi:10.18653/v1/2022.findings-emnlp.196 2022
[39]

Sshubam Verma, Mohammed Safi Ur Rahman Khan, Vishwajeet Kumar, Rudra Murthy, and Jaydeep Sen. 2025. https://doi.org/10.18653/v1/2025.naacl-long.507 MILU : A multi-task I ndic language understanding benchmark . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Techn...

work page doi:10.18653/v1/2025.naacl-long.507 2025
[40]

Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. 2024. https://doi.org/10.18653/v1/2024.acl-long.820 Do llamas work in E nglish? on the latent language of multilingual transformers . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15366--15394, Bangkok, Thailand....

work page doi:10.18653/v1/2024.acl-long.820 2024
[41]

Shijie Wu and Mark Dredze. 2020. https://doi.org/10.18653/v1/2020.repl4nlp-1.16 Are all languages created equal in multilingual BERT ? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120--130, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.repl4nlp-1.16 2020
[42]

Weihao Xuan, Rui Yang, Heli Qi, Qingcheng Zeng, Yunze Xiao, Aosong Feng, Dairui Liu, Yun Xing, Junjue Wang, Fan Gao, Jinghui Lu, Yuang Jiang, Huitao Li, Xin Li, Kunyu Yu, Ruihai Dong, Shangding Gu, Yuekang Li, Xiaofei Xie, and 13 others. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.79 MMLU - P ro X : A multilingual benchmark for advanced large langua...

work page doi:10.18653/v1/2025.emnlp-main.79 2025
[43]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[44]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

[1] [1]

David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba Oluwadara Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, Chiamaka Ijeoma Chukwuneke, Happy Buzaaba, Blessing Kudzaishe Sibanda, Godson Koffi Kalipe, Jonathan Mukiibi, Salomon Kabongo Kabenamualu, Foutse Yuehgoh, Mmasibidi Setaka, Lolwe...

work page doi:10.18653/v1/2025.naacl-long.139 2025

[2] [2]

Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David Mortensen, Noah Smith, and Yulia Tsvetkov. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.614 Do all languages cost the same? tokenization in the era of commercial language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904--9923, ...

work page doi:10.18653/v1/2023.emnlp-main.614 2023

[3] [3]

Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Mohamed Ahmed, Kalika Bali, and Sunayana Sitaram. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.258 MEGA : Multilingual evaluation of generative AI . In Proceedings of the 2023 Conference on Empirical Methods in Natural ...

work page doi:10.18653/v1/2023.emnlp-main.258 2023

[4] [4]

Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Mohamed Ahmed, Kalika Bali, and Sunayana Sitaram. 2024. https://doi.org/10.18653/v1/2024.naacl-long.143 MEGAVERSE : Benchmarking large language models across languages, modalities, models and tasks . In Proceedings of the 2024 Confere...

work page doi:10.18653/v1/2024.naacl-long.143 2024

[5] [5]

David Anugraha, Genta Indra Winata, Chenyue Li, Patrick Amadeus Irawan, and En-Shiun Annie Lee. 2025. https://doi.org/10.18653/v1/2025.findings-naacl.106 P roxy LM : Predicting language model performance on multilingual tasks via proxy models . In Findings of the Association for Computational Linguistics: NAACL 2025, pages 1981--2011, Albuquerque, New Mex...

work page doi:10.18653/v1/2025.findings-naacl.106 2025

[6] [6]

Akari Asai, Sneha Kudugunta, Xinyan Yu, Terra Blevins, Hila Gonen, Machel Reid, Yulia Tsvetkov, Sebastian Ruder, and Hannaneh Hajishirzi. 2024. https://doi.org/10.18653/v1/2024.naacl-long.100 BUFFET : Benchmarking large language models for few-shot cross-lingual transfer . In Proceedings of the 2024 Conference of the North American Chapter of the Associat...

work page doi:10.18653/v1/2024.naacl-long.100 2024

[7] [7]

Baayen, D.J

R.H. Baayen, D.J. Davidson, and D.M. Bates. 2008. https://doi.org/10.1016/j.jml.2007.12.005 Mixed-effects modeling with crossed random effects for subjects and items . Journal of Memory and Language, 59(4):390--412. Special Issue: Emerging Data Analysis

work page doi:10.1016/j.jml.2007.12.005 2008

[8] [8]

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. 2024. https://doi.org/10.18653/v1/2024.acl-long.44 The belebele benchmark: a parallel reading comprehension dataset in 122 language variants . In Proceedings of the 62nd Annual Meeting of t...

work page doi:10.18653/v1/2024.acl-long.44 2024

[9] [9]

Yoav Benjamini and Yosef Hochberg. 1995. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x Controlling the false discovery rate: A practical and powerful approach to multiple testing . Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289--300

work page doi:10.1111/j.2517-6161.1995.tb02031.x 1995

[10] [10]

Damian Blasi, Antonios Anastasopoulos, and Graham Neubig. 2022. https://doi.org/10.18653/v1/2022.acl-long.376 Systematic inequalities in language technology performance across the world ' s languages . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486--5505, Dublin, Ireland. Asso...

work page doi:10.18653/v1/2022.acl-long.376 2022

[11] [11]

Terra Blevins and Luke Zettlemoyer. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.233 Language contamination helps explains the cross-lingual capabilities of E nglish pretrained models . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3563--3574, Abu Dhabi, United Arab Emirates. Association for Computat...

work page doi:10.18653/v1/2022.emnlp-main.233 2022

[12] [12]

Monojit Choudhury and Amit Deshpande. 2021. How linguistically fair are multilingual pre-trained language models? In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 12710--12718

2021

[13] [13]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://doi.org/10.18653/v1/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for Comp...

work page doi:10.18653/v1/2020.acl-main.747 2020

[14] [14]

Bowman, Holger Schwenk, and Veselin Stoyanov

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. https://doi.org/10.18653/v1/D18-1269 XNLI : Evaluating cross-lingual sentence representations . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475--2485, Brussels, Belgium. Associat...

work page doi:10.18653/v1/d18-1269 2018

[15] [15]

Marta R. Costa-juss \`a , James Cross, Onur C elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, and 20 others. 2024. https://doi.org/10.1038/s41586-024-0...

work page doi:10.1038/s41586-024-07335-x 2024

[16] [16]

Marta R Costa-Juss \`a , James Cross, Onur C elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, and 1 others. 2022. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Cl \'e ment Dumas, Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. 2025. https://doi.org/10.18653/v1/2025.acl-long.1536 Separating tongue from thought: Activation patching reveals language-agnostic concept representations in transformers . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volu...

work page doi:10.18653/v1/2025.acl-long.1536 2025

[18] [18]

Olive Jean Dunn. 1964. http://www.jstor.org/stable/1266041 Multiple comparisons using rank sums . Technometrics, 6(3):241--252

work page arXiv 1964

[19] [19]

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others. 2024. https://doi.org/10.5281/zenodo.12608602 The languag...

work page doi:10.5281/zenodo.12608602 2024

[20] [20]

Songbo Hu, Ivan Vuli \'c , and Anna Korhonen. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.199 Quantifying language disparities in multilingual large language models . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4003--4018, Suzhou, China. Association for Computational Linguistics

work page doi:10.18653/v1/2025.emnlp-main.199 2025

[21] [21]

Songbo Hu, Han Zhou, Moy Yuan, Milan Gritta, Guchun Zhang, Ignacio Iacobacci, Anna Korhonen, and Ivan Vuli \'c . 2023. https://doi.org/10.18653/v1/2023.emnlp-main.422 A systematic study of performance disparities in multilingual task-oriented dialogue systems . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, page...

work page doi:10.18653/v1/2023.emnlp-main.422 2023

[22] [22]

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. https://doi.org/10.18653/v1/2020.acl-main.560 The state and fate of linguistic diversity and inclusion in the NLP world . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282--6293, Online. Association for Computational...

work page doi:10.18653/v1/2020.acl-main.560 2020

[23] [23]

Maurice G Kendall and B Babington Smith. 1939. The problem of m rankings. The annals of mathematical statistics, 10(3):275--287

1939

[24] [24]

Simran Khanuja, Sebastian Ruder, and Partha Talukdar. 2023. https://doi.org/10.18653/v1/2023.findings-eacl.131 Evaluating the diversity, equity, and inclusion of NLP technology: A case study for I ndian languages . In Findings of the Association for Computational Linguistics: EACL 2023, pages 1763--1777, Dubrovnik, Croatia. Association for Computational L...

work page doi:10.18653/v1/2023.findings-eacl.131 2023

[25] [25]

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. https://proceedings.mlr.press/v97/kornblith19a.html Similarity of neural network representations revisited . In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3519--3529. PMLR

2019

[26] [26]

Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Baldwin. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.760 Large language models only pass primary school exams in I ndonesia: A comprehensive test on I ndo MMLU . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12359--12374, Singapore. Association for Co...

work page doi:10.18653/v1/2023.emnlp-main.760 2023

[27] [27]

Anne Lauscher, Vinit Ravishankar, Ivan Vuli \'c , and Goran Glava s . 2020. https://doi.org/10.18653/v1/2020.emnlp-main.363 From zero to hero: O n the limitations of zero-shot language transfer with multilingual T ransformers . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4483--4499, Online. Asso...

work page doi:10.18653/v1/2020.emnlp-main.363 2020

[28] [28]

Zihao Li, Yucheng Shi, Zirui Liu, Fan Yang, Ali Payani, Ninghao Liu, and Mengnan Du. 2025. https://doi.org/10.1609/aaai.v39i27.35038 Language ranker: A metric for quantifying llm performance across high and low-resource languages . Proceedings of the AAAI Conference on Artificial Intelligence, 39(27):28186–28194

work page doi:10.1609/aaai.v39i27.35038 2025

[29] [29]

Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin

Patrick Littell, David R. Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin. 2017. https://aclanthology.org/E17-2002/ URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors . In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume ...

2017

[30] [30]

Jessica Ojo, Odunayo Ogundepo, Akintunde Oladipo, Kelechi Ogueji, Jimmy Lin, Pontus Stenetorp, and David Ifeoluwa Adelani. 2025. https://doi.org/10.18653/v1/2025.findings-acl.976 A fro B ench: How good are large language models on A frican languages? In Findings of the Association for Computational Linguistics: ACL 2025, pages 19048--19095, Vienna, Austri...

work page doi:10.18653/v1/2025.findings-acl.976 2025

[31] [31]

Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. 2023. https://openreview.net/forum?id=78yDLKi95p Language model tokenizers introduce unfairness between languages . In Thirty-seventh Conference on Neural Information Processing Systems

2023

[32] [32]

Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability . In Advances in Neural Information Processing Systems, volume 30. Curran ...

2017

[33] [33]

Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, and 38 others

Angelika Romanou, Negar Foroutan, Anna Sotnikova, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Zeming Chen, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, and 38 others. 2025. https://openreview.net/f...

2025

[34] [34]

Phillip Rust, Jonas Pfeiffer, Ivan Vuli \'c , Sebastian Ruder, and Iryna Gurevych. 2021. https://doi.org/10.18653/v1/2021.acl-long.243 How good is your tokenizer? on the monolingual performance of multilingual language models . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confe...

work page doi:10.18653/v1/2021.acl-long.243 2021

[35] [35]

Amartya Sen. 1976. https://doi.org/10.2307/2296597 Real national income . The Review of Economic Studies, 43(1):19--39

work page doi:10.2307/2296597 1976

[36] [36]

Shivalika Singh, Angelika Romanou, Cl \'e mentine Fourrier, David Ifeoluwa Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Sebastian Ruder, Wei-Yin Ko, Antoine Bosselut, Alice Oh, Andre Martins, Leshem Choshen, Daphne Ippolito, and 4 others. 2025. https://doi....

work page doi:10.18653/v1/2025.acl-long.919 2025

[37] [37]

Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann Railey Montalan, Jian Gang Ngui, Xianbin Yong, Wei Qi Leong, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Yifan Mai, and William Chandra Tjhi. 2025. https://doi.org/10.18653/v1/2025.findings-acl.636 SEA - HELM : S outheast A sian holistic evaluation of language models . In Findings of the Associati...

work page doi:10.18653/v1/2025.findings-acl.636 2025

[38] [38]

Dennis Ulmer, Elisa Bassignana, Max M \"u ller-Eberstein, Daniel Varab, Mike Zhang, Rob van der Goot, Christian Hardmeier, and Barbara Plank. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.196 Experimental standards for deep learning in natural language processing research . In Findings of the Association for Computational Linguistics: EMNLP 2022, ...

work page doi:10.18653/v1/2022.findings-emnlp.196 2022

[39] [39]

Sshubam Verma, Mohammed Safi Ur Rahman Khan, Vishwajeet Kumar, Rudra Murthy, and Jaydeep Sen. 2025. https://doi.org/10.18653/v1/2025.naacl-long.507 MILU : A multi-task I ndic language understanding benchmark . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Techn...

work page doi:10.18653/v1/2025.naacl-long.507 2025

[40] [40]

Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. 2024. https://doi.org/10.18653/v1/2024.acl-long.820 Do llamas work in E nglish? on the latent language of multilingual transformers . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15366--15394, Bangkok, Thailand....

work page doi:10.18653/v1/2024.acl-long.820 2024

[41] [41]

Shijie Wu and Mark Dredze. 2020. https://doi.org/10.18653/v1/2020.repl4nlp-1.16 Are all languages created equal in multilingual BERT ? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120--130, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.repl4nlp-1.16 2020

[42] [42]

Weihao Xuan, Rui Yang, Heli Qi, Qingcheng Zeng, Yunze Xiao, Aosong Feng, Dairui Liu, Yun Xing, Junjue Wang, Fan Gao, Jinghui Lu, Yuang Jiang, Huitao Li, Xin Li, Kunyu Yu, Ruihai Dong, Shangding Gu, Yuekang Li, Xiaofei Xie, and 13 others. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.79 MMLU - P ro X : A multilingual benchmark for advanced large langua...

work page doi:10.18653/v1/2025.emnlp-main.79 2025

[43] [43]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

[44] [44]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...