pith. sign in

arxiv: 2605.28163 · v1 · pith:RZUDR4WDnew · submitted 2026-05-27 · 💻 cs.CL · cs.AI

DEPART: DEcomposing PARiTy across Multilingual LLMs

Pith reviewed 2026-06-29 12:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multilingual LLMsperformance variance decompositionBayesian hierarchical modelinglanguage featuresrepresentational similarityNLU vs reasoning
0
0 comments X

The pith

Observable language features explain 79% of variance on understanding tasks and 92% on reasoning in multilingual LLMs, with internal similarity to English as the strongest predictor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first confirms that performance gaps across languages in mLLMs are systematic rather than random using distribution-free statistical tests. It then applies a two-step Bayesian hierarchical decomposition to separate variance due to language identity from other factors. Observable traits such as script, language family, and typological distance account for most of the language-level variance, while a model's representational closeness to English stands out as the main driver. The framework also shows that model identity explains the bulk of variance in understanding tasks, but benchmark-model interactions dominate in reasoning tasks. This turns multilingual evaluation into a diagnostic tool that points to concrete causes of disparity.

Core claim

Observable language features explain R²_ling = 79% of language-identity variance on understanding tasks and 92% on reasoning tasks, with a model's internal representational similarity to English as the dominant predictor in both cases; model identity accounts for 66.7% of overall variance in understanding while the benchmark-by-model interaction accounts for 46.3% in reasoning.

What carries the argument

two-step Bayesian hierarchical framework that first isolates variance attributable to language identity and then decomposes the full model-by-benchmark-by-language cube

If this is right

  • Evaluation can move from reporting raw per-language scores to attributing gaps to measurable language properties and model internals.
  • Improving a model's internal alignment with English representations should narrow gaps more effectively than other interventions for both understanding and reasoning.
  • Different task types require different mitigation strategies because their dominant variance sources differ.
  • New languages can be placed on existing performance curves once their script, family, typological distance, and representational similarity are measured.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition could be applied to generation or translation tasks to test whether the same language features remain dominant.
  • If representational similarity to English is the strongest lever, then targeted continued pretraining on high-similarity languages might lift low-performing ones more efficiently than broad multilingual data scaling.
  • Benchmark designers could use the variance profiles to create more balanced test suites that reduce the benchmark-by-model interaction term.

Load-bearing premise

The two-step decomposition isolates genuine attributable variance components rather than reflecting artifacts from how the mLLMs were trained or how the performance numbers were collected.

What would settle it

A regression of task performance on the listed language features and representational similarity measures that fails to reduce residual variance by approximately the reported R² amounts, or that shows a different dominant predictor after the same controls.

Figures

Figures reproduced from arXiv: 2605.28163 by Himanshu Beniwal, Manan Uppadhyay, Pranjal Chitale, Prashant Kodali, Reshma Ramaprasad, Sunayana Sitaram.

Figure 1
Figure 1. Figure 1: Structural factors dominate cross-lingual performance gaps in multilingual LLMs. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The three nested model specifications fit in this work, in [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
read the original abstract

Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving systemic biases unattributed and offering practitioners no actionable levers. We first establish that these gaps are systematic rather than artifacts of sampling noise via distribution-free Friedman and Kruskal--Wallis tests, then introduce a two-step Bayesian hierarchical framework that decomposes multilingual performance variance into interpretable components. First, isolating the variance attributable to language identity, we show that observable language features (script, family, typological distance) explain $R^2_{\text{ling}} = 79\%$ of this variance on understanding tasks and $92\%$ on reasoning, with a model's internal representational similarity to English emerging as the dominant predictor across both task buckets. Second, decomposing the full (model$\times$benchmark$\times$language) cube, we find that NLU and reasoning have fundamentally divergent variance profiles: model identity dominates understanding ($66.7\%$ of variance), whereas the benchmark$\times$model interaction dominates reasoning ($46.3\%$). Together these results recast multilingual evaluation from passive performance mapping into an explainable, diagnostic framework with concrete levers for targeting the root drivers of language disparity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that performance disparities across languages in multilingual LLMs are systematic (established via Friedman and Kruskal-Wallis tests) rather than noise. It introduces a two-step Bayesian hierarchical decomposition: first isolating language-identity variance and showing that observable features (script, family, typological distance) plus representational similarity to English explain R²_ling = 79% on understanding tasks and 92% on reasoning; second, decomposing the full model×benchmark×language cube to reveal that model identity accounts for 66.7% of variance in understanding while the benchmark×model interaction accounts for 46.3% in reasoning. The work positions this as a diagnostic framework with actionable levers for disparity.

Significance. If the decomposition is valid and unconfounded, the results would meaningfully advance multilingual evaluation by shifting from leaderboard reporting to attribution of root causes, identifying concrete targets such as representational alignment and benchmark design. The initial distribution-free tests are a clear strength for establishing non-random gaps independent of the later modeling step.

major comments (2)
  1. [§3 (two-step Bayesian hierarchical framework)] Two-step Bayesian hierarchical framework (abstract and §3): the central R²_ling = 79%/92% claims and the subsequent 66.7%/46.3% attributions rest on the hierarchical model isolating variance components attributable to language features versus training-corpus confounders. No model equations, prior specifications, or conditional-independence assumptions are provided, so it is impossible to verify whether paths from token-count imbalances or script coverage in pretraining data are blocked.
  2. [abstract and decomposition results] Decomposition results (abstract): the claim that observable language features explain the stated percentages of language-identity variance requires that the first step conditions on the second-step cube decomposition. Without explicit variance-component equations or sensitivity checks to training-data imbalances, the reported percentages could be artifacts of how the performance data were generated rather than true attributable shares.
minor comments (1)
  1. [abstract] Notation for R²_ling and the variance percentages should be defined with explicit reference to the hierarchical model components to avoid ambiguity between the language-only step and the full cube decomposition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in our Bayesian framework. We address each major comment below and will revise the manuscript to include the requested specifications, equations, and checks.

read point-by-point responses
  1. Referee: [§3 (two-step Bayesian hierarchical framework)] Two-step Bayesian hierarchical framework (abstract and §3): the central R²_ling = 79%/92% claims and the subsequent 66.7%/46.3% attributions rest on the hierarchical model isolating variance components attributable to language features versus training-corpus confounders. No model equations, prior specifications, or conditional-independence assumptions are provided, so it is impossible to verify whether paths from token-count imbalances or script coverage in pretraining data are blocked.

    Authors: We agree that the current manuscript lacks the explicit model equations, prior specifications, and conditional-independence assumptions needed to verify isolation from training-corpus confounders. The revised version will add the full mathematical specification of the two-step Bayesian hierarchical model, including all equations, the priors employed, and a discussion of how the model addresses (or blocks) paths from pretraining statistics such as token counts and script coverage. revision: yes

  2. Referee: [abstract and decomposition results] Decomposition results (abstract): the claim that observable language features explain the stated percentages of language-identity variance requires that the first step conditions on the second-step cube decomposition. Without explicit variance-component equations or sensitivity checks to training-data imbalances, the reported percentages could be artifacts of how the performance data were generated rather than true attributable shares.

    Authors: The first step operates on language-identity variance extracted from the full cube, while the second step decomposes the complete model×benchmark×language structure; the steps are sequential rather than the first conditioning on the second. We acknowledge the absence of explicit variance-component equations and sensitivity checks. The revision will supply the variance-component equations, clarify the sequential structure, and incorporate sensitivity analyses to training-data imbalances to demonstrate that the reported R² values are robust. revision: yes

Circularity Check

0 steps flagged

No significant circularity in Bayesian variance decomposition

full rationale

The derivation begins with distribution-free Friedman and Kruskal-Wallis tests to establish non-random performance gaps, followed by a two-step Bayesian hierarchical model whose R²_ling figures are the direct fitted outputs measuring how much variance in those gaps is associated with external observables (script, family, typological distance, representational similarity). These observables are defined independently of the performance cube; the model does not redefine them in terms of the target R², nor does any prediction reduce to a fitted parameter by construction. No self-citation, uniqueness theorem, or ansatz is invoked as load-bearing support for the decomposition itself. The reported percentages are therefore ordinary statistical summaries, not circular reductions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the Bayesian hierarchical variance decomposition and the assumption that language features can be treated as explanatory variables in the model.

free parameters (1)
  • variance components in the hierarchical model
    The Bayesian framework fits multiple variance parameters to attribute portions of the performance cube.
axioms (1)
  • domain assumption Performance gaps are systematic rather than sampling noise, as established by distribution-free Friedman and Kruskal-Wallis tests
    The paper invokes these tests before proceeding to the decomposition step.

pith-pipeline@v0.9.1-grok · 5766 in / 1196 out tokens · 37323 ms · 2026-06-29T12:58:19.769539+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba Oluwadara Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, Chiamaka Ijeoma Chukwuneke, Happy Buzaaba, Blessing Kudzaishe Sibanda, Godson Koffi Kalipe, Jonathan Mukiibi, Salomon Kabongo Kabenamualu, Foutse Yuehgoh, Mmasibidi Setaka, Lolwe...

  2. [2]

    Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David Mortensen, Noah Smith, and Yulia Tsvetkov. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.614 Do all languages cost the same? tokenization in the era of commercial language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904--9923, ...

  3. [3]

    Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Mohamed Ahmed, Kalika Bali, and Sunayana Sitaram. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.258 MEGA : Multilingual evaluation of generative AI . In Proceedings of the 2023 Conference on Empirical Methods in Natural ...

  4. [4]

    Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Mohamed Ahmed, Kalika Bali, and Sunayana Sitaram. 2024. https://doi.org/10.18653/v1/2024.naacl-long.143 MEGAVERSE : Benchmarking large language models across languages, modalities, models and tasks . In Proceedings of the 2024 Confere...

  5. [5]

    David Anugraha, Genta Indra Winata, Chenyue Li, Patrick Amadeus Irawan, and En-Shiun Annie Lee. 2025. https://doi.org/10.18653/v1/2025.findings-naacl.106 P roxy LM : Predicting language model performance on multilingual tasks via proxy models . In Findings of the Association for Computational Linguistics: NAACL 2025, pages 1981--2011, Albuquerque, New Mex...

  6. [6]

    Akari Asai, Sneha Kudugunta, Xinyan Yu, Terra Blevins, Hila Gonen, Machel Reid, Yulia Tsvetkov, Sebastian Ruder, and Hannaneh Hajishirzi. 2024. https://doi.org/10.18653/v1/2024.naacl-long.100 BUFFET : Benchmarking large language models for few-shot cross-lingual transfer . In Proceedings of the 2024 Conference of the North American Chapter of the Associat...

  7. [7]

    Baayen, D.J

    R.H. Baayen, D.J. Davidson, and D.M. Bates. 2008. https://doi.org/10.1016/j.jml.2007.12.005 Mixed-effects modeling with crossed random effects for subjects and items . Journal of Memory and Language, 59(4):390--412. Special Issue: Emerging Data Analysis

  8. [8]

    Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. 2024. https://doi.org/10.18653/v1/2024.acl-long.44 The belebele benchmark: a parallel reading comprehension dataset in 122 language variants . In Proceedings of the 62nd Annual Meeting of t...

  9. [9]

    Yoav Benjamini and Yosef Hochberg. 1995. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x Controlling the false discovery rate: A practical and powerful approach to multiple testing . Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289--300

  10. [10]

    Damian Blasi, Antonios Anastasopoulos, and Graham Neubig. 2022. https://doi.org/10.18653/v1/2022.acl-long.376 Systematic inequalities in language technology performance across the world ' s languages . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486--5505, Dublin, Ireland. Asso...

  11. [11]

    Terra Blevins and Luke Zettlemoyer. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.233 Language contamination helps explains the cross-lingual capabilities of E nglish pretrained models . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3563--3574, Abu Dhabi, United Arab Emirates. Association for Computat...

  12. [12]

    Monojit Choudhury and Amit Deshpande. 2021. How linguistically fair are multilingual pre-trained language models? In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 12710--12718

  13. [13]

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://doi.org/10.18653/v1/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for Comp...

  14. [14]

    Bowman, Holger Schwenk, and Veselin Stoyanov

    Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. https://doi.org/10.18653/v1/D18-1269 XNLI : Evaluating cross-lingual sentence representations . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475--2485, Brussels, Belgium. Associat...

  15. [15]

    Marta R. Costa-juss \`a , James Cross, Onur C elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, and 20 others. 2024. https://doi.org/10.1038/s41586-024-0...

  16. [16]

    Marta R Costa-Juss \`a , James Cross, Onur C elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, and 1 others. 2022. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672

  17. [17]

    Cl \'e ment Dumas, Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. 2025. https://doi.org/10.18653/v1/2025.acl-long.1536 Separating tongue from thought: Activation patching reveals language-agnostic concept representations in transformers . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volu...

  18. [18]

    Olive Jean Dunn. 1964. http://www.jstor.org/stable/1266041 Multiple comparisons using rank sums . Technometrics, 6(3):241--252

  19. [19]

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others. 2024. https://doi.org/10.5281/zenodo.12608602 The languag...

  20. [20]

    Songbo Hu, Ivan Vuli \'c , and Anna Korhonen. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.199 Quantifying language disparities in multilingual large language models . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4003--4018, Suzhou, China. Association for Computational Linguistics

  21. [21]

    Songbo Hu, Han Zhou, Moy Yuan, Milan Gritta, Guchun Zhang, Ignacio Iacobacci, Anna Korhonen, and Ivan Vuli \'c . 2023. https://doi.org/10.18653/v1/2023.emnlp-main.422 A systematic study of performance disparities in multilingual task-oriented dialogue systems . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, page...

  22. [22]

    Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. https://doi.org/10.18653/v1/2020.acl-main.560 The state and fate of linguistic diversity and inclusion in the NLP world . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282--6293, Online. Association for Computational...

  23. [23]

    Maurice G Kendall and B Babington Smith. 1939. The problem of m rankings. The annals of mathematical statistics, 10(3):275--287

  24. [24]

    Simran Khanuja, Sebastian Ruder, and Partha Talukdar. 2023. https://doi.org/10.18653/v1/2023.findings-eacl.131 Evaluating the diversity, equity, and inclusion of NLP technology: A case study for I ndian languages . In Findings of the Association for Computational Linguistics: EACL 2023, pages 1763--1777, Dubrovnik, Croatia. Association for Computational L...

  25. [25]

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. https://proceedings.mlr.press/v97/kornblith19a.html Similarity of neural network representations revisited . In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3519--3529. PMLR

  26. [26]

    Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Baldwin. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.760 Large language models only pass primary school exams in I ndonesia: A comprehensive test on I ndo MMLU . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12359--12374, Singapore. Association for Co...

  27. [27]

    Anne Lauscher, Vinit Ravishankar, Ivan Vuli \'c , and Goran Glava s . 2020. https://doi.org/10.18653/v1/2020.emnlp-main.363 From zero to hero: O n the limitations of zero-shot language transfer with multilingual T ransformers . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4483--4499, Online. Asso...

  28. [28]

    Zihao Li, Yucheng Shi, Zirui Liu, Fan Yang, Ali Payani, Ninghao Liu, and Mengnan Du. 2025. https://doi.org/10.1609/aaai.v39i27.35038 Language ranker: A metric for quantifying llm performance across high and low-resource languages . Proceedings of the AAAI Conference on Artificial Intelligence, 39(27):28186–28194

  29. [29]

    Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin

    Patrick Littell, David R. Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin. 2017. https://aclanthology.org/E17-2002/ URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors . In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume ...

  30. [30]

    Jessica Ojo, Odunayo Ogundepo, Akintunde Oladipo, Kelechi Ogueji, Jimmy Lin, Pontus Stenetorp, and David Ifeoluwa Adelani. 2025. https://doi.org/10.18653/v1/2025.findings-acl.976 A fro B ench: How good are large language models on A frican languages? In Findings of the Association for Computational Linguistics: ACL 2025, pages 19048--19095, Vienna, Austri...

  31. [31]

    Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. 2023. https://openreview.net/forum?id=78yDLKi95p Language model tokenizers introduce unfairness between languages . In Thirty-seventh Conference on Neural Information Processing Systems

  32. [32]

    Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability . In Advances in Neural Information Processing Systems, volume 30. Curran ...

  33. [33]

    Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, and 38 others

    Angelika Romanou, Negar Foroutan, Anna Sotnikova, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Zeming Chen, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, and 38 others. 2025. https://openreview.net/f...

  34. [34]

    Phillip Rust, Jonas Pfeiffer, Ivan Vuli \'c , Sebastian Ruder, and Iryna Gurevych. 2021. https://doi.org/10.18653/v1/2021.acl-long.243 How good is your tokenizer? on the monolingual performance of multilingual language models . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confe...

  35. [35]

    Amartya Sen. 1976. https://doi.org/10.2307/2296597 Real national income . The Review of Economic Studies, 43(1):19--39

  36. [36]

    Shivalika Singh, Angelika Romanou, Cl \'e mentine Fourrier, David Ifeoluwa Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Sebastian Ruder, Wei-Yin Ko, Antoine Bosselut, Alice Oh, Andre Martins, Leshem Choshen, Daphne Ippolito, and 4 others. 2025. https://doi....

  37. [37]

    Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann Railey Montalan, Jian Gang Ngui, Xianbin Yong, Wei Qi Leong, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Yifan Mai, and William Chandra Tjhi. 2025. https://doi.org/10.18653/v1/2025.findings-acl.636 SEA - HELM : S outheast A sian holistic evaluation of language models . In Findings of the Associati...

  38. [38]

    Dennis Ulmer, Elisa Bassignana, Max M \"u ller-Eberstein, Daniel Varab, Mike Zhang, Rob van der Goot, Christian Hardmeier, and Barbara Plank. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.196 Experimental standards for deep learning in natural language processing research . In Findings of the Association for Computational Linguistics: EMNLP 2022, ...

  39. [39]

    Sshubam Verma, Mohammed Safi Ur Rahman Khan, Vishwajeet Kumar, Rudra Murthy, and Jaydeep Sen. 2025. https://doi.org/10.18653/v1/2025.naacl-long.507 MILU : A multi-task I ndic language understanding benchmark . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Techn...

  40. [40]

    Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. 2024. https://doi.org/10.18653/v1/2024.acl-long.820 Do llamas work in E nglish? on the latent language of multilingual transformers . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15366--15394, Bangkok, Thailand....

  41. [41]

    Shijie Wu and Mark Dredze. 2020. https://doi.org/10.18653/v1/2020.repl4nlp-1.16 Are all languages created equal in multilingual BERT ? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120--130, Online. Association for Computational Linguistics

  42. [42]

    Weihao Xuan, Rui Yang, Heli Qi, Qingcheng Zeng, Yunze Xiao, Aosong Feng, Dairui Liu, Yun Xing, Junjue Wang, Fan Gao, Jinghui Lu, Yuang Jiang, Huitao Li, Xin Li, Kunyu Yu, Ruihai Dong, Shangding Gu, Yuekang Li, Xiaofei Xie, and 13 others. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.79 MMLU - P ro X : A multilingual benchmark for advanced large langua...

  43. [43]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  44. [44]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...