arxiv: 2605.05902 · v1 · submitted 2026-05-07 · 💻 cs.SE

Recognition: unknown

Evaluating Non-English Developer Support in Machine Learning for Software Engineering

Jonathan Katzy , Yongcheng Huang , Gopal-Raj Panchu , Maksym Ziemlewski , Paris Loizides , Sander Vermeulen , Arie van Deursen , Maliheh Izadi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:07 UTC · model grok-4.3

classification 💻 cs.SE

keywords non-English code commentsmultilingual LLMscode comment generationevaluation metricsLLM-as-a-judgehuman annotationsoftware engineering

0 comments

The pith

No automatic approach reliably evaluates non-English code comments from large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests five code LLMs on generating comments in Dutch, Greek, Polish, Chinese, and English. It creates a human-annotated dataset of 12,500 comments and a taxonomy of 26 error types through open coding. The results reveal that comment quality falls outside English, with linguistic errors rising sharply, and that standard neural metrics and even LLM judges do not match human assessments consistently. This highlights barriers in using current tools for multilingual software development where code mixes with non-English text.

Core claim

Generative performance deteriorates substantially outside English, with linguistic errors increasing by up to 15.1 times, alongside more incoherent and semantic errors. No automatic approach provides reliable and consistent assessment: neural metrics fail to distinguish correct comments from incorrect outputs or random noise and overestimate quality in non-English settings, while LLM-as-a-judge methods achieve highest agreement with humans but miss important language-related and semantic errors. Human judgment remains indispensable for evaluating such outputs.

What carries the argument

A taxonomy of 26 error types derived from open-coding 12,500 generated comments, paired with human annotations to benchmark overlap-based metrics, neural metrics, and LLM-as-a-judge pipelines for non-English code comment quality.

If this is right

Generative performance drops substantially outside English with linguistic errors up to 15.1 times higher.
Neural metrics cannot distinguish correct comments from incorrect ones or noise and overestimate non-English quality.
LLM-as-a-judge methods agree most with humans but still fail to capture language and semantic errors.
Evaluation and generation barriers persist for multilingual software engineering tooling.
Human judgment is required as automatic methods lack reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the evaluation to more languages or additional code tasks like bug fixing could reveal if the patterns hold more broadly.
Tool builders might need to incorporate language-specific modules or better multilingual pretraining to address the quality drop.
The emphasis on human annotation suggests that hybrid human-AI evaluation pipelines could be developed for practical use.
Similar issues may affect other non-English natural language elements in code, such as variable names or documentation.

Load-bearing premise

That the five languages and five models tested are representative of multilingual code comment generation in general, and that the 26 error types from the open-coding study cover the main quality problems.

What would settle it

A new automatic evaluation method that shows strong agreement with human judgments on a held-out set of non-English comments from additional languages or models, or a demonstration that neural metrics can reliably separate correct from random outputs in non-English settings.

read the original abstract

Large Language Models are increasingly used in software engineering, but both code generation and its evaluation remain predominantly English-centric. This leaves a major gap in our understanding of how well current tools support multilingual development, where code contains non-English natural language. In this paper, we investigate non-English code comment generation and the reliability of current methods for evaluating such outputs. We evaluate five code LLMs (CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2) across five natural languages: Dutch, English, Greek, Polish and Chinese. We further conduct an open-coding study of 12,500 generated comments, from which we derive a publicly released human-annotated dataset and a taxonomy of 26 error types. We use these human annotations, to evaluate the performance of neural metrics, and LLM-as-a-judge pipelines. Our findings show that generative performance deteriorates substantially outside English, with linguistic errors increasing by up to 15.1$\times$, alongside frequent incoherent generations and a rise in semantic errors. More critically, we show that detecting errors in non-English comments underperforms. Across classical overlap-based metrics, off-the-shelf neural metrics, extended neural metrics using newer multilingual, language-specific, and code-specific models, and LLM-as-a-judge pipelines, no automatic approach provides reliable and consistent assessment. Neural metrics fail to distinguish correct comments from incorrect outputs or even random noise, and tend to overestimate quality in non-English settings. LLM-as-a-judge methods achieve the highest agreement with human annotations but fail to reliably capture important language-related and semantic errors. Overall, our results show that evaluation and generation are key barriers for multilingual tooling, and that human judgment remains indispensable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows automatic metrics fail to evaluate non-English code comments reliably, backed by a new 12,500-comment dataset and 26-error taxonomy from five languages and models.

read the letter

This paper finds that code comment generation quality drops outside English and that no automatic metric catches the errors consistently. They ran five code LLMs on Dutch, English, Greek, Polish, and Chinese, then open-coded 12,500 outputs to create a public dataset and a taxonomy of 26 error types. They tested overlap metrics, neural metrics, multilingual and code-specific variants, and LLM-as-judge pipelines against the human labels. Linguistic errors rose sharply, up to 15 times in some cases, with more incoherent and semantic mistakes too. Neural metrics could not separate good comments from bad ones or even random noise and tended to overrate non-English outputs. LLM judges matched humans best but still missed key language and semantic problems. The scale of the annotation effort and the head-to-head metric comparison are the real strengths here; they give a concrete picture of where current tools break. The main limitation is the narrow slice of languages and models. Results from these five might not hold for Arabic or Spanish, or for the next generation of models, so the strong claim that human judgment is indispensable rests on untested representativeness. This work is aimed at researchers building or evaluating multilingual code generation systems. It supplies data and a benchmark that others can use directly. I would send it for peer review; the empirical core is solid enough to deserve referee input even if the generalization needs tightening.

Referee Report

2 major / 2 minor

Summary. This paper investigates the challenges of non-English code comment generation using LLMs and evaluates the effectiveness of automatic metrics for assessing such outputs. It examines five code LLMs across five languages (Dutch, English, Greek, Polish, Chinese), performs open-coding on 12,500 generated comments to create a taxonomy of 26 error types and a public dataset, and compares classical overlap metrics, neural metrics (including multilingual and code-specific variants), and LLM-as-a-judge approaches against human annotations. The key findings are that generation quality declines markedly for non-English languages with increased linguistic and semantic errors, and that no automatic evaluation method reliably matches human judgments, with neural metrics particularly prone to overestimation and LLM judges missing key error types.

Significance. The results, if substantiated, are significant because they provide empirical evidence of the English-centric limitations in both generation and evaluation for ML-based software engineering tools. The public release of the annotated dataset and the error taxonomy represent valuable contributions that can facilitate future research in multilingual SE. This work highlights the indispensability of human judgment in this domain and could influence the development of more robust multilingual evaluation frameworks.

major comments (2)

The derivation of the 26-error taxonomy via open-coding on the 12,500 comments is central to the evaluation; however, the paper lacks reporting of inter-annotator agreement metrics (such as Fleiss' kappa) and the process for resolving disagreements, which undermines confidence in the ground truth labels used to assess all automatic methods.
The claim that no automatic approach provides reliable assessment across non-English settings is based solely on the five selected languages and models; without additional experiments or discussion on languages with different characteristics (e.g., right-to-left scripts or agglutinative languages), the broad conclusion risks overstating the results' applicability.

minor comments (2)

The abstract mentions 'linguistic errors increasing by up to 15.1×' but does not specify the baseline (English) or the exact metric used for this multiplier, which could be clarified for precision.
Some figures comparing metric scores across languages could benefit from error bars or statistical significance tests to strengthen the visual claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and specify the changes we will make to the manuscript.

read point-by-point responses

Referee: The derivation of the 26-error taxonomy via open-coding on the 12,500 comments is central to the evaluation; however, the paper lacks reporting of inter-annotator agreement metrics (such as Fleiss' kappa) and the process for resolving disagreements, which undermines confidence in the ground truth labels used to assess all automatic methods.

Authors: We agree that explicit reporting of inter-annotator agreement and disagreement resolution is necessary to substantiate the reliability of the taxonomy and human annotations. The open-coding process involved multiple annotators, and we will add a dedicated subsection describing the full annotation protocol, including Fleiss' kappa values and the consensus procedure used to resolve disagreements. These details will be included in the revised manuscript to strengthen confidence in the ground-truth labels. revision: yes
Referee: The claim that no automatic approach provides reliable assessment across non-English settings is based solely on the five selected languages and models; without additional experiments or discussion on languages with different characteristics (e.g., right-to-left scripts or agglutinative languages), the broad conclusion risks overstating the results' applicability.

Authors: We appreciate this point on the scope of our language selection. The five languages span multiple families and scripts, but we recognize they do not exhaustively cover all linguistic phenomena such as right-to-left scripts or agglutinative structures. We will add an expanded limitations paragraph that explicitly discusses the generalizability constraints of our findings and outlines opportunities for future work on additional language types, thereby qualifying the breadth of our conclusions without requiring new experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation study

full rationale

The paper performs an empirical evaluation: it generates comments with five LLMs across five languages, conducts open-coding on 12,500 outputs to produce a 26-error taxonomy and human annotations, then directly compares automatic metrics (overlap, neural, LLM-as-judge) against those annotations. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the claimed results. All findings are observational comparisons on the collected data, with no reduction of outputs to inputs by construction. The study is self-contained against its own human labels.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study is empirical and relies on standard assumptions in software engineering research rather than new theoretical constructs.

axioms (1)

domain assumption Human annotations obtained via open coding provide a reliable ground truth for comment quality and error classification.
The paper uses these annotations to benchmark all automatic metrics and to derive the error taxonomy.

pith-pipeline@v0.9.0 · 5642 in / 1282 out tokens · 76496 ms · 2026-05-08T09:07:42.866932+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

197 extracted references · 50 canonical work pages · 8 internal anchors

[1]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review arXiv
[2]

Code Llama: Open Foundation Models for Code

Code llama: Open foundation models for code , author=. arXiv preprint arXiv:2308.12950 , year=

work page internal anchor Pith review arXiv
[3]

2012 , publisher=

Experimentation in software engineering , author=. 2012 , publisher=

2012
[4]

arXiv preprint arXiv:1912.09582 , year=

Bertje: A dutch bert model , author=. arXiv preprint arXiv:1912.09582 , year=

work page arXiv 1912
[5]

Biochemia medica , volume=

Interrater reliability: the kappa statistic , author=. Biochemia medica , volume=. 2012 , publisher=

2012
[6]

2022 , url =

The State of the Octoverse: Global Tech Talent Report 2022 , author =. 2022 , url =

2022
[7]

2024 , eprint=

EuroLLM: Multilingual Language Models for Europe , author=. 2024 , eprint=

2024
[8]

Theoretical Linguistics , doi =

Slavic languages – “SVO” languages without SVO qualities? , author =. Theoretical Linguistics , doi =. 2022 , lastchecked =

2022
[9]

Journal of semantics , volume=

Time in a language without tense: The case of Chinese , author=. Journal of semantics , volume=. 2006 , publisher=

2006
[10]

UCLWorking Paper in Linguistics , volume=

Greek word order: towards a new approach , author=. UCLWorking Paper in Linguistics , volume=
[11]

Still All G reeklish to Me: G reeklish to G reek Transliteration

Toumazatos, Anastasios and Pavlopoulos, John and Androutsopoulos, Ion and Vassos, Stavros. Still All G reeklish to Me: G reeklish to G reek Transliteration. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

2024
[12]

Outline of classical Chinese grammar , author=
[13]

Proceedings of the 44th international conference on software engineering , year=

Practitioners' expectations on automated code comment generation , author=. Proceedings of the 44th international conference on software engineering , year=
[14]

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

Language models for code completion: A practical evaluation , author=. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=
[15]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=
[16]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

2004
[17]

Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization , pages=

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments , author=. Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization , pages=
[18]

Proceedings of the tenth workshop on statistical machine translation , pages=

chrF: character n-gram F-score for automatic MT evaluation , author=. Proceedings of the tenth workshop on statistical machine translation , pages=
[19]

Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1) , pages=

EED: Extended edit distance measure for machine translation , author=. Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1) , pages=
[20]

Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers , pages=

Character: Translation edit rate on character level , author=. Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers , pages=
[21]

2024 , eprint=

CodeGemma: Open Code Models Based on Gemma , author=. 2024 , eprint=

2024
[22]

Can LLMs Replace Manual Annotation of Software Engineering Artifacts? , year=

Toufique Ahmed and Premkumar Devanbu and Christoph Treude and Michael Pradel , booktitle=. Can LLMs Replace Manual Annotation of Software Engineering Artifacts? , year=
[23]

Is this code written in English? A study of the natural language of comments and identifiers in practice , year=

Pawelka, Timo and Juergens, Elmar , booktitle=. Is this code written in English? A study of the natural language of comments and identifiers in practice , year=
[24]

InProceedings of the 2024 ACM Conference on Human Factors in Computing Systems (CHI ’24)

Mozannar, Hussein and Bansal, Gagan and Fourney, Adam and Horvitz, Eric , title =. 2024 , isbn =. doi:10.1145/3613904.3641936 , booktitle =

work page doi:10.1145/3613904.3641936 2024
[25]

and Yang, Chenyang and Myers, Brad A

Liang, Jenny T. and Yang, Chenyang and Myers, Brad A. , title =. 2024 , isbn =. doi:10.1145/3597503.3608128 , booktitle =

work page doi:10.1145/3597503.3608128 2024
[26]

Manifold code-mixing in computer-mediated communication: The use of English in Dutch youths’ informal online writing , journal =

Lieke Verheijen and Roeland. Manifold code-mixing in computer-mediated communication: The use of English in Dutch youths’ informal online writing , journal =. 2022 , issn =. doi:https://doi.org/10.1016/j.amper.2022.100091 , url =

work page doi:10.1016/j.amper.2022.100091 2022
[27]

2006 , publisher=

The Germanic Languages , author=. 2006 , publisher=

2006
[28]

van Haeringen , title =

C.B. van Haeringen , title =. 1956 , publisher =

1956
[29]

O pen S ubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles

Lison, Pierre and Tiedemann, J. O pen S ubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. Proceedings of the Tenth International Conference on Language Resources and Evaluation ( LREC '16). 2016

2016
[30]

International Conference on Learning Representations , year=

BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=
[31]

C ode BERT : A Pre-Trained Model for Programming and Natural Languages

Feng, Zhangyin and Guo, Daya and Tang, Duyu and Duan, Nan and Feng, Xiaocheng and Gong, Ming and Shou, Linjun and Qin, Bing and Liu, Ting and Jiang, Daxin and Zhou, Ming. C ode BERT : A Pre-Trained Model for Programming and Natural Languages. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020

2020
[32]

C ode BERTS core: Evaluating Code Generation with Pretrained Models of Code

Zhou, Shuyan and Alon, Uri and Agarwal, Sumit and Neubig, Graham. C ode BERTS core: Evaluating Code Generation with Pretrained Models of Code. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023

2023
[33]

NeurIPS , volume=

Bartscore: Evaluating generated text as text generation , author=. NeurIPS , volume=
[34]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , volume=

work page internal anchor Pith review arXiv 1907
[35]

and Salakhutdinov, Ruslan and Urtasun, Raquel and Torralba, Antonio and Fidler, Sanja , year =

Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books , author=. arXiv preprint arXiv:1506.06724 , year=

work page arXiv
[36]

Bleurt: Learning robust metrics for text gener- ation

BLEURT: Learning robust metrics for text generation , author=. arXiv preprint arXiv:2004.04696 , year=

work page arXiv 2004
[37]

2024 , howpublished =

2024
[38]

2024 , eprint=

Granite Code Models: A Family of Open Foundation Models for Code Intelligence , author=. 2024 , eprint=

2024
[39]

Proceedings of naacL-HLT , volume=

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of naacL-HLT , volume=. 2019 , organization=

2019
[40]

2022 , eprint=

Productivity Assessment of Neural Code Completion , author=. 2022 , eprint=

2022
[41]

Transactions of the Association for Computational Linguistics , year=

SummEval: Re-evaluating Summarization Evaluation , author=. Transactions of the Association for Computational Linguistics , year=
[42]

and Martinez, Fernando and Houde, Stephanie and Muller, Michael and Weisz, Justin D

Ross, Steven I. and Martinez, Fernando and Houde, Stephanie and Muller, Michael and Weisz, Justin D. , year=. The Programmer’s Assistant: Conversational Interaction with a Large Language Model for Software Development , url=. doi:10.1145/3581641.3584037 , booktitle=

work page doi:10.1145/3581641.3584037
[43]

and Denny, Paul and Finnie-Ansley, James and Luxton-Reilly, Andrew and Prather, James and Santos, Eddie Antonio , title =

Becker, Brett A. and Denny, Paul and Finnie-Ansley, James and Luxton-Reilly, Andrew and Prather, James and Santos, Eddie Antonio , title =. Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 , pages =. 2023 , isbn =. doi:10.1145/3545945.3569759 , abstract =

work page doi:10.1145/3545945.3569759 2023
[44]

2024 , eprint=

StarCoder 2 and The Stack v2: The Next Generation , author=. 2024 , eprint=

2024
[45]

2023 , eprint=

Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries , author=. 2023 , eprint=

2023
[46]

2021 , eprint=

Topic Recommendation for Software Repositories using Multi-label Classification Algorithms , author=. 2021 , eprint=

2021
[47]

Large Language Models are Few-Shot Summarizers: Multi-Intent Comment Generation via In-Context Learning , url=

Geng, Mingyang and Wang, Shangwen and Dong, Dezun and Wang, Haotian and Li, Ge and Jin, Zhi and Mao, Xiaoguang and Liao, Xiangke , title =. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , articleno =. 2024 , isbn =. doi:10.1145/3597503.3608134 , abstract =

work page doi:10.1145/3597503.3608134 2024
[48]

SeTransformer: A Transformer-Based Code Semantic Parser for Code Comment Generation , year=

Li, Zheng and Wu, Yonghao and Peng, Bin and Chen, Xiang and Sun, Zeyu and Liu, Yong and Paul, Doyle , journal=. SeTransformer: A Transformer-Based Code Semantic Parser for Code Comment Generation , year=
[49]

Large Language Models Only Pass Primary School Exams in I ndonesia: A Comprehensive Test on I ndo MMLU

Koto, Fajri and Aisyah, Nurul and Li, Haonan and Baldwin, Timothy. Large Language Models Only Pass Primary School Exams in I ndonesia: A Comprehensive Test on I ndo MMLU. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.760

work page doi:10.18653/v1/2023.emnlp-main.760 2023
[50]

How good are Large Language Models on African Languages? , author=
[51]

2023 , eprint=

Don't Trust ChatGPT when Your Question is not in English: A Study of Multilingual Abilities and Types of LLMs , author=. 2023 , eprint=

2023
[52]

Exploring Multi-Lingual Bias of Large Code Models in Code Generation , author=
[53]

2022 , eprint=

Automatic Code Documentation Generation Using GPT-3 , author=. 2022 , eprint=

2022
[54]

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , articleno =

Ahmed, Toufique and Pai, Kunal Suresh and Devanbu, Premkumar and Barr, Earl , title =. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , articleno =. 2024 , isbn =

2024
[55]

Few-shot training llms for project-specific code-summarization,

Ahmed, Toufique and Devanbu, Premkumar , title =. Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering , articleno =. 2023 , isbn =. doi:10.1145/3551349.3559555 , abstract =

work page doi:10.1145/3551349.3559555 2023
[56]

2021 , eprint=

Code to Comment Translation: A Comparative Study on Model Effectiveness & Errors , author=. 2021 , eprint=

2021
[57]

2024 , eprint=

Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming , author=. 2024 , eprint=

2024
[58]

2024 , eprint=

LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback , author=. 2024 , eprint=

2024
[59]

International Conference on Learning Representations , year=

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , author=. International Conference on Learning Representations , year=
[60]

Cross-lingual Language Model Pretraining

Cross-lingual language model pretraining , author=. arXiv preprint arXiv:1901.07291 , year=

work page Pith review arXiv 1901
[61]

Miao Xiong and Zhiyuan Hu and Xinyang Lu and YIFEI LI and Jie Fu and Junxian He and Bryan Hooi , booktitle=. Can. 2024 , url=

2024
[62]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=
[63]

C hat GPT Beyond E nglish: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning

Lai, Viet Dac and Ngo, Nghia and Pouran Ben Veyseh, Amir and Man, Hieu and Dernoncourt, Franck and Bui, Trung and Nguyen, Thien Huu. C hat GPT Beyond E nglish: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.878

work page doi:10.18653/v1/2023.findings-emnlp.878 2023
[64]

2024 , eprint=

How Vocabulary Sharing Facilitates Multilingualism in LLaMA? , author=. 2024 , eprint=

2024
[65]

2024 , eprint=

MaLA-500: Massive Language Adaptation of Large Language Models , author=. 2024 , eprint=

2024
[66]

The Multidimensional Quality Metric (MQM) framework: A new framework for translation quality assessment , journal=

Mariana, Valerie R , year=. The Multidimensional Quality Metric (MQM) framework: A new framework for translation quality assessment , journal=
[67]

Empirical Software Engineering , year=

Bugs in large language models generated code: An empirical study , author=. Empirical Software Engineering , year=
[68]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

2020
[69]

2013 , eprint=

Efficient Estimation of Word Representations in Vector Space , author=. 2013 , eprint=

2013
[70]

Code with CodeQwen1.5 , url =

Qwen Team , month =. Code with CodeQwen1.5 , url =
[71]

2024 , eprint=

Gemma: Open Models Based on Gemini Research and Technology , author=. 2024 , eprint=

2024
[72]

Scientometrics , year=

Why do I publish research articles in English instead of my own language? Differences in Spanish researchers’ motivations across scientific domains , author=. Scientometrics , year=
[73]

arXiv preprint arXiv:2509.01322 , year=

Longcat-flash technical report , author=. arXiv preprint arXiv:2509.01322 , year=

work page arXiv
[74]

Grok 4.1 Model Card , year =
[75]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review arXiv
[76]

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al

Qwen3-coder-next technical report , author=. arXiv preprint arXiv:2603.00729 , year=

work page arXiv
[77]

Qwen3-VL Technical Report

Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review arXiv
[78]

2022 , url =

langdetect (Revision 0215f72) , author=. 2022 , url =. doi:10.57967/hf/0135 , publisher =

work page doi:10.57967/hf/0135 2022
[79]

2024 , eprint=

RedPajama: an Open Dataset for Training Large Language Models , author=. 2024 , eprint=

2024
[80]

2020 , eprint=

The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. 2020 , eprint=

2020

Showing first 80 references.