YOMI-Bench: A Benchmark for Evaluating Kanji Reading and Phonological Understanding of LLMs for Japanese

Hiroya Takamura; Hitomi Yanaka; Ryota Mibayashi

arxiv: 2607.00664 · v1 · pith:6WBPSGHXnew · submitted 2026-07-01 · 💻 cs.CL

YOMI-Bench: A Benchmark for Evaluating Kanji Reading and Phonological Understanding of LLMs for Japanese

Ryota Mibayashi , Hiroya Takamura , Hitomi Yanaka This is my paper

Pith reviewed 2026-07-02 13:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords kanji readingJapanese LLMsbenchmarkphonological understandingLLM evaluationmultiple readingsgeneration tasksYOMI-Bench

0 comments

The pith

Even Japanese-specific LLMs perform poorly on kanji reading tasks according to a new benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Kanji characters in Japanese often have multiple possible readings, so surface text alone does not determine the correct pronunciation. The paper introduces YOMI-Bench, a set of four tasks built to test LLMs on kanji reading and phonological understanding. Evaluations across one multilingual open model, four Japanese-specific open models, and five commercial models show low performance overall. Japanese-specific models still struggle, and commercial models fare especially poorly on generation tasks that require selecting the right reading. A reader would care because this pinpoints a concrete limitation in applying current LLMs to Japanese text production and analysis.

Core claim

The paper proposes YOMI-Bench consisting of four tasks to evaluate kanji reading performance in Japanese. Through systematic testing, it establishes that even Japanese-specific open LLMs exhibit low performance, while commercial LLMs also perform poorly on generation tasks that require consideration of kanji readings.

What carries the argument

YOMI-Bench, a benchmark of four tasks designed to measure LLMs' ability to handle multiple possible readings for individual kanji characters.

If this is right

Even Japanese-specific open LLMs exhibit low performance on the benchmark tasks.
Commercial LLMs perform poorly on generation tasks that require consideration of kanji readings.
Multilingual open LLMs also show low performance due to the linguistic characteristic of multiple kanji readings.
The benchmark reveals that inferring correct readings from surface-level text alone remains difficult for current models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training data that explicitly covers phonological variations across kanji readings could raise scores on these tasks.
The same task design could be adapted to test reading disambiguation in other languages with character-based scripts.
Low benchmark results suggest that Japanese text generation systems may produce incorrect readings until this gap is closed.
Model developers could add explicit disambiguation layers focused on Japanese phonology to improve reliability.

Load-bearing premise

The four tasks in YOMI-Bench are valid and unbiased measures of kanji reading ability that reflect real-world difficulties rather than artifacts of task design.

What would settle it

A model achieving consistently high accuracy, such as above 80 percent, across all four tasks while retaining general language capabilities would falsify the reported low performance.

Figures

Figures reproduced from arXiv: 2607.00664 by Hiroya Takamura, Hitomi Yanaka, Ryota Mibayashi.

**Figure 1.** Figure 1: The overview of YOMI-Bench. the kanji character “覚” has three possible readings: “kaku,” “obo,” and “sa.” These readings vary depending on the word in which the character appears, such as “覚醒 (kakusei/Awakening),” “覚える (oboeru/Memorize),” and “覚める (sameru/Wake up).” Thus, in order to correctly predict the readings of kanji characters appearing within individual words, models should not only possess the k… view at source ↗

read the original abstract

We propose YOMI-Bench, a benchmark for evaluating kanji reading and phonological understanding of large language models (LLMs) for Japanese. In Japanese, a single kanji character often has multiple possible readings, making it difficult to infer the correct reading from surface-level text alone. Due to these linguistic characteristics, it is empirically known that LLMs exhibit low performance in kanji reading for Japanese. The proposed YOMI-Bench consists of four tasks specifically designed to evaluate kanji reading performance in Japanese. In our evaluation using YOMI-Bench, we assessed one multilingual open LLM, four Japanese-specific open LLMs, and five commercial LLMs. As a result, we found that even Japanese-specific models show low performance, and that commercial models also perform poorly on generation tasks that require consideration of kanji readings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

YOMI-Bench adds a targeted kanji-reading eval that current models fail, but the abstract gives almost no task details so the results are hard to trust yet.

read the letter

The paper introduces YOMI-Bench with four tasks meant to probe how LLMs handle the multiple possible readings of kanji in Japanese. They run it on one multilingual open model, four Japanese-specific open models, and five commercial ones, and report low scores overall, especially on generation. That matches the known issue that surface text alone often does not determine the right pronunciation.

What the work does is fill a narrow but real gap: no prior benchmark focused specifically on this phonological ambiguity for Japanese LLMs. Including both open and closed models is straightforward and useful for comparison.

The main weakness is that the abstract supplies no information on how the four tasks were built, what sources the items come from, how difficulty was controlled, or what the actual scores and error patterns look like. Without those pieces it is difficult to know whether the low performance reflects genuine model limits or quirks in the test design. The claim that even Japanese-specific models struggle is plausible but rests on unshown evidence.

This paper is mainly for people who evaluate or fine-tune Japanese or multilingual LLMs and want a dedicated check on kanji phonology. A reader already working on non-English script handling would find the task list worth looking at once the construction details appear.

It deserves a serious referee because a clean benchmark in this area would be worth having if the tasks hold up under scrutiny. I would send it out for review and ask the authors to add concrete examples, data sources, and basic statistics in the next version.

Referee Report

1 major / 1 minor

Summary. The paper proposes YOMI-Bench, a benchmark consisting of four tasks to evaluate the kanji reading and phonological understanding of LLMs for Japanese. It evaluates one multilingual open LLM, four Japanese-specific open LLMs, and five commercial LLMs, concluding that even Japanese-specific models exhibit low performance and that commercial models perform poorly on generation tasks requiring consideration of kanji readings.

Significance. If the benchmark tasks are shown to be valid and unbiased measures of kanji reading ability, this work could highlight important limitations in current LLMs for handling Japanese language specifics, which is relevant for improving multilingual and language-specific models. The evaluation across different model types provides a broad view of the issue.

major comments (1)

[Abstract] Abstract: The abstract provides no details on task construction, data sources, metrics, or statistical significance, making it impossible to verify whether the reported low performance supports the claims about model capabilities.

minor comments (1)

The paper could benefit from including example instances from the four tasks to illustrate the evaluation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript. We address the major comment point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract provides no details on task construction, data sources, metrics, or statistical significance, making it impossible to verify whether the reported low performance supports the claims about model capabilities.

Authors: We agree that the current abstract is high-level and omits key details on task construction, data sources, and metrics, which are instead described in Sections 3 (Benchmark Construction) and 4 (Experiments). This is a valid observation. To address it, we will revise the abstract to concisely summarize the four tasks, their data sources (e.g., existing Japanese corpora and manually curated examples), the primary metrics (accuracy and F1 for reading prediction/generation), and note that evaluations are deterministic with no statistical significance testing applied due to the fixed benchmark design. This change will allow readers to better assess the claims directly from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes YOMI-Bench as a new benchmark with four tasks for evaluating kanji reading in LLMs and reports direct evaluation results on open and commercial models. No derivations, equations, fitted parameters, predictions, or uniqueness theorems are present. The claims rest on empirical performance numbers from the benchmark application itself, with no self-referential reductions or load-bearing self-citations that collapse the argument. This is a standard benchmark paper whose central content is self-contained against external model evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark proposal paper with no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.1-grok · 5681 in / 942 out tokens · 22224 ms · 2026-07-02T13:01:59.164136+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 20 canonical work pages · 2 internal anchors

[1]

Neural substrates of phonological selection for Japanese character Kanji based on fMRI investigations

Kayako Matsuo and Shen-Hsing Annabel Chen and Chih-Wei Hue and Chiao-Yi Wu and Epifanio Bagarinao and Wen-Yih Isaac Tseng and Toshiharu Nakai. Neural substrates of phonological selection for Japanese character Kanji based on fMRI investigations. NeuroImage. 2010. doi:10.1016/j.neuroimage.2009.12.099

work page doi:10.1016/j.neuroimage.2009.12.099 2010
[2]

Historical Analysis of Japanese Writing Systems Hiragana, Katakana, and Kanji

Yessy Harun and Febi Nur Biduri. Historical Analysis of Japanese Writing Systems Hiragana, Katakana, and Kanji. International Journal of Social Service and Research. 2024

2024
[4]

arXiv preprint arXiv:2412.14471

Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs , author=. arXiv preprint arXiv:2412.14471. 2025. doi:10.48550/arXiv.2402.01349

work page doi:10.48550/arxiv.2402.01349 2025
[5]

What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations

Katharina Trinley and Toshiki Nakai and Tatiana Anikina and Tanja Baeumel. What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations. arXiv preprint arXiv:2507.20279. 2025

work page arXiv 2025
[6]

Benchmax: A comprehensive multilingual evaluation suite for large language models

Xu Huang and Wenhao Zhu and Hanxu Hu and Conghui He and Lei Li and Shujian Huang and Fei Yuan. BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models. arXiv preprint arXiv:2502.07346. 2025

work page arXiv 2025
[7]

The NazoNazo Benchmark: A Cost-Effective and Extensible Test of Insight-Based Reasoning in LLMs

Masaharu Mizumoto and Dat Nguyen and Zhiheng Han and Jiyuan Fang and Heyuan Guan and Xingfu Li and Naoya Shiraishi and Xuyang Tian and Yo Nakawake and Le Minh Nguyen. The NazoNazo Benchmark: A Cost-Effective and Extensible Test of Insight-Based Reasoning in LLMs. arXiv preprint arXiv:2509.14704. 2025

work page arXiv 2025
[8]

Multilingual Large Language Models: A Systematic Survey

Shaolin Zhu and Supryadi and Shaoyang Xu and Haoran Sun and Leiyu Pan and Menglong Cui and Jiangcun Du and Renren Jin and António Branco and Deyi Xiong. Multilingual Large Language Models: A Systematic Survey. arXiv preprint arXiv:2411.11072. 2024. doi:10.48550/arXiv.2411.11072

work page doi:10.48550/arxiv.2411.11072 2024
[9]

A survey on large language model benchmarks, 2025

Shiwen Ni and Guhong Chen and Shuaimin Li and Xuanang Chen and Siyi Li and Bingli Wang and Qiyao Wang and Xingjian Wang and Yifan Zhang and Liyang Fan and Chengming Li and Ruifeng Xu and Le Sun and Min Yang. A Survey on Large Language Model Benchmarks. arXiv preprint arXiv:2508.15361. 2025

work page arXiv 2025
[10]

Japanese Rhyme Generation Based on Mora Similarity and Generation Probability

Mibayashi, Ryota and Yamamoto, Takehiro and Ohshima, Hiroaki. Japanese Rhyme Generation Based on Mora Similarity and Generation Probability. Proceedings of the 27th International Conference on Information Integration and Web Intelligence. 2025. doi:10.1007/978-3-032-11976-6_7

work page doi:10.1007/978-3-032-11976-6_7 2025
[11]

and Kann, Katharina and Mielke, Sabrina J

Cotterell, Ryan and Kirov, Christo and Sylak-Glassman, John and Walther, G \'e raldine and Vylomova, Ekaterina and McCarthy, Arya D. and Kann, Katharina and Mielke, Sabrina J. and Nicolai, Garrett and Silfverberg, Miikka and Yarowsky, David and Eisner, Jason and Hulden, Mans. The C o NLL -- SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection...

work page doi:10.18653/v1/k18-3001 2018
[12]

Barrault, Lo \"i c and Biesialska, Magdalena and Bojar, Ond r ej and Costa-juss \`a , Marta R. and Federmann, Christian and Graham, Yvette and Grundkiewicz, Roman and Haddow, Barry and Huck, Matthias and Joanis, Eric and Kocmi, Tom and Koehn, Philipp and Lo, Chi-kiu and Ljube s i \'c , Nikola and Monz, Christof and Morishita, Makoto and Nagata, Masaaki an...

2020
[13]

Training Verifiers to Solve Math Word Problems

Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John. Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168. 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Saiful and Mubasshir, Kazi and Li, Yuan-Fang and Kang, Yong-Bin and Rahman, M

Hasan, Tahmid and Bhattacharjee, Abhik and Islam, Md. Saiful and Mubasshir, Kazi and Li, Yuan-Fang and Kang, Yong-Bin and Rahman, M. Sohel and Shahriyar, Rifat. XL -Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages. Proceedings of Findings of the Association for Computational Linguistics. 2021

2021
[15]

Development of a question answering system focused on an encyclopedia

Sekine, Satoshi. Development of a question answering system focused on an encyclopedia. Proceedings of the 9th Annual Meeting of the Association for Natural Language Processing. 2003

2003
[16]

JEMH op QA : Dataset for J apanese Explainable Multi-Hop Question Answering

Ishii, Ai and Inoue, Naoya and Suzuki, Hisami and Sekine, Satoshi. JEMH op QA : Dataset for J apanese Explainable Multi-Hop Question Answering. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. 2024

2024
[17]

R oman L ens: The Role Of Latent R omanization In Multilinguality In LLM s

Saji, Alan and Husain, Jaavid Aktar and Jayakumar, Thanmay and Dabre, Raj and Kunchukuttan, Anoop and Puduppully, Ratish. R oman L ens: The Role Of Latent R omanization In Multilinguality In LLM s. Findings of the Association for Computational Linguistics. 2025

2025
[18]

Large language models are not robust multiple choice selectors, 2024

Zheng, Chujie and Zhou, Hao and Meng, Fandong and Zhou, Jie and Huang, Minlie. Large Language Models Are Not Robust Multiple Choice Selectors. International Conference on Representation Learning. 2024. doi:10.48550/arXiv.2309.03882

work page doi:10.48550/arxiv.2309.03882 2024
[19]

Measuring Massive Multitask Language Understanding

Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt. Measuring Massive Multitask Language Understanding. International Conference on Learning Representations. 2021

2021
[20]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. Proceedings of the 37th International Conference on Neural Information Proce...

2023
[21]

Large Language Models Lack Understanding of Character Composition of Words

Andrew Shin and Kunitake Kaneko. Large Language Models Lack Understanding of Character Composition of Words. ICML Workshop on Large Language Models and Cognition. 2024

2024
[22]

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman. GPQA : A Graduate-Level Google-Proof Q&A Benchmark. Proceedings of the First Conference on Language Modeling. 2024

2024
[23]

Large language models are zero-shot reasoners

Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke. Large language models are zero-shot reasoners. Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022

2022
[24]

Exploring the Potential of Prompt-Based Method for Kanji-Kana Conversion in Japanese Braille Translation

Micah Kitsunai, Deborah Watty, Shu-Kai Hsieh. Exploring the Potential of Prompt-Based Method for Kanji-Kana Conversion in Japanese Braille Translation. In the 29th Annual Meeting of Japanese Association for Natural Language Processing. 2024

2024
[25]

JGLUE : J apanese General Language Understanding Evaluation

Kurihara, Kentaro and Kawahara, Daisuke and Shibata, Tomohide. JGLUE : J apanese General Language Understanding Evaluation. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022

2022
[26]

Should We Respect LLM s? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance

Yin, Ziqi and Wang, Hao and Horio, Kaito and Kawahara, Daisuke and Sekine, Satoshi. Should We Respect LLM s? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance. Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024). 2024. doi:10.18653/v1/2024.sicon-1.2

work page doi:10.18653/v1/2024.sicon-1.2 2024
[27]

ByT5: Towards a token-free future with pre-trained byte-to-byte models , journal =

Linting Xue and Aditya Barua and Noah Constant and Rami Al. ByT5: Towards a token-free future with pre-trained byte-to-byte models , journal =. 2021 , url =. 2105.13626 , timestamp =

work page arXiv 2021
[28]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[29]

Dan Gusfield , title =. 1997

1997
[30]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[31]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[32]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

DeepRapper : N eural Rap Generation with Rhyme and Rhythm Modeling

Liqiang Xue and Kaitao Song and Di Wu and Xu Tan and Ningyu Zhang and Tao Qin and Wentao Zhang and Tie-Yan Liu. DeepRapper : N eural Rap Generation with Rhyme and Rhythm Modeling. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021. doi:10....

work page doi:10.18653/v1/2021.acl-long.6 2021
[34]

and Ba, J

Kingma, D. and Ba, J. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representation. 2015

2015
[35]

and Addanki, K

Wu, D. and Addanki, K. Learning to Rap Battle with Bilingual Recursive Neural Networks. Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015

2015
[36]

Ghost W riter: U sing an LSTM for Automatic Rap Lyric Generation

Peter Potash and Alexey Romanov and Anna Rumshisky. Ghost W riter: U sing an LSTM for Automatic Rap Lyric Generation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015. doi:https://doi.org/10.18653/v1/D15-1221

work page doi:10.18653/v1/d15-1221 2015
[37]

PLATO-Ad: A Unified Advertisement Text Generation Framework with Multi-Task Prompt Learning

Zeyang Lei and Chao Zhang and Xinchao Xu and Wenquan Wu and Zheng-yu Niu and Hua Wu and Haifeng Wang and Yi Yang and Shuanglong Li. PLATO-Ad: A Unified Advertisement Text Generation Framework with Multi-Task Prompt Learning. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2022. doi:10.18653/v1/2022.e...

work page doi:10.18653/v1/2022.emnlp-industry.52 2022
[38]

DopeLearning: A Computational Approach to Rap Lyrics Generation

Eetu Malmi and Pyry Takala and Hannu Toivonen and Tapani Raiko and Aristides Gionis. DopeLearning: A Computational Approach to Rap Lyrics Generation. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. doi:10.1145/2939672.2939841

work page doi:10.1145/2939672.2939841 2016
[39]

Verse Generation by Reverse Generation Considering Rhyme and Answer in Japanese Rap Battles

Mibayashi, Ryota and Yamamoto, Takehiro and Tsukuda, Kosetsu and Watanabe, Kento and Nakano, Tomoyasu and Goto, Masataka and Ohshima, Hiroaki. Verse Generation by Reverse Generation Considering Rhyme and Answer in Japanese Rap Battles. Proceedings of the 16th International Symposium on Computer Music Multidisciplinary Research. 2023. doi:10.5281/zenodo.10109961

work page doi:10.5281/zenodo.10109961 2023
[40]

Nikolov and Eetu Malmi and Curtis G

Nikola I. Nikolov and Eetu Malmi and Curtis G. Northcutt and Lorenzo Parisi. Rapformer: Conditional Rap Lyrics Generation with Denoising Autoencoders. Proceedings of the 13th International Conference on Natural Language Generation. 2020. doi:10.18653/v1/2020.inlg-1.42

work page doi:10.18653/v1/2020.inlg-1.42 2020

[1] [1]

Neural substrates of phonological selection for Japanese character Kanji based on fMRI investigations

Kayako Matsuo and Shen-Hsing Annabel Chen and Chih-Wei Hue and Chiao-Yi Wu and Epifanio Bagarinao and Wen-Yih Isaac Tseng and Toshiharu Nakai. Neural substrates of phonological selection for Japanese character Kanji based on fMRI investigations. NeuroImage. 2010. doi:10.1016/j.neuroimage.2009.12.099

work page doi:10.1016/j.neuroimage.2009.12.099 2010

[2] [2]

Historical Analysis of Japanese Writing Systems Hiragana, Katakana, and Kanji

Yessy Harun and Febi Nur Biduri. Historical Analysis of Japanese Writing Systems Hiragana, Katakana, and Kanji. International Journal of Social Service and Research. 2024

2024

[3] [4]

arXiv preprint arXiv:2412.14471

Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs , author=. arXiv preprint arXiv:2412.14471. 2025. doi:10.48550/arXiv.2402.01349

work page doi:10.48550/arxiv.2402.01349 2025

[4] [5]

What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations

Katharina Trinley and Toshiki Nakai and Tatiana Anikina and Tanja Baeumel. What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations. arXiv preprint arXiv:2507.20279. 2025

work page arXiv 2025

[5] [6]

Benchmax: A comprehensive multilingual evaluation suite for large language models

Xu Huang and Wenhao Zhu and Hanxu Hu and Conghui He and Lei Li and Shujian Huang and Fei Yuan. BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models. arXiv preprint arXiv:2502.07346. 2025

work page arXiv 2025

[6] [7]

The NazoNazo Benchmark: A Cost-Effective and Extensible Test of Insight-Based Reasoning in LLMs

Masaharu Mizumoto and Dat Nguyen and Zhiheng Han and Jiyuan Fang and Heyuan Guan and Xingfu Li and Naoya Shiraishi and Xuyang Tian and Yo Nakawake and Le Minh Nguyen. The NazoNazo Benchmark: A Cost-Effective and Extensible Test of Insight-Based Reasoning in LLMs. arXiv preprint arXiv:2509.14704. 2025

work page arXiv 2025

[7] [8]

Multilingual Large Language Models: A Systematic Survey

Shaolin Zhu and Supryadi and Shaoyang Xu and Haoran Sun and Leiyu Pan and Menglong Cui and Jiangcun Du and Renren Jin and António Branco and Deyi Xiong. Multilingual Large Language Models: A Systematic Survey. arXiv preprint arXiv:2411.11072. 2024. doi:10.48550/arXiv.2411.11072

work page doi:10.48550/arxiv.2411.11072 2024

[8] [9]

A survey on large language model benchmarks, 2025

Shiwen Ni and Guhong Chen and Shuaimin Li and Xuanang Chen and Siyi Li and Bingli Wang and Qiyao Wang and Xingjian Wang and Yifan Zhang and Liyang Fan and Chengming Li and Ruifeng Xu and Le Sun and Min Yang. A Survey on Large Language Model Benchmarks. arXiv preprint arXiv:2508.15361. 2025

work page arXiv 2025

[9] [10]

Japanese Rhyme Generation Based on Mora Similarity and Generation Probability

Mibayashi, Ryota and Yamamoto, Takehiro and Ohshima, Hiroaki. Japanese Rhyme Generation Based on Mora Similarity and Generation Probability. Proceedings of the 27th International Conference on Information Integration and Web Intelligence. 2025. doi:10.1007/978-3-032-11976-6_7

work page doi:10.1007/978-3-032-11976-6_7 2025

[10] [11]

and Kann, Katharina and Mielke, Sabrina J

Cotterell, Ryan and Kirov, Christo and Sylak-Glassman, John and Walther, G \'e raldine and Vylomova, Ekaterina and McCarthy, Arya D. and Kann, Katharina and Mielke, Sabrina J. and Nicolai, Garrett and Silfverberg, Miikka and Yarowsky, David and Eisner, Jason and Hulden, Mans. The C o NLL -- SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection...

work page doi:10.18653/v1/k18-3001 2018

[11] [12]

Barrault, Lo \"i c and Biesialska, Magdalena and Bojar, Ond r ej and Costa-juss \`a , Marta R. and Federmann, Christian and Graham, Yvette and Grundkiewicz, Roman and Haddow, Barry and Huck, Matthias and Joanis, Eric and Kocmi, Tom and Koehn, Philipp and Lo, Chi-kiu and Ljube s i \'c , Nikola and Monz, Christof and Morishita, Makoto and Nagata, Masaaki an...

2020

[12] [13]

Training Verifiers to Solve Math Word Problems

Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John. Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168. 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [14]

Saiful and Mubasshir, Kazi and Li, Yuan-Fang and Kang, Yong-Bin and Rahman, M

Hasan, Tahmid and Bhattacharjee, Abhik and Islam, Md. Saiful and Mubasshir, Kazi and Li, Yuan-Fang and Kang, Yong-Bin and Rahman, M. Sohel and Shahriyar, Rifat. XL -Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages. Proceedings of Findings of the Association for Computational Linguistics. 2021

2021

[14] [15]

Development of a question answering system focused on an encyclopedia

Sekine, Satoshi. Development of a question answering system focused on an encyclopedia. Proceedings of the 9th Annual Meeting of the Association for Natural Language Processing. 2003

2003

[15] [16]

JEMH op QA : Dataset for J apanese Explainable Multi-Hop Question Answering

Ishii, Ai and Inoue, Naoya and Suzuki, Hisami and Sekine, Satoshi. JEMH op QA : Dataset for J apanese Explainable Multi-Hop Question Answering. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. 2024

2024

[16] [17]

R oman L ens: The Role Of Latent R omanization In Multilinguality In LLM s

Saji, Alan and Husain, Jaavid Aktar and Jayakumar, Thanmay and Dabre, Raj and Kunchukuttan, Anoop and Puduppully, Ratish. R oman L ens: The Role Of Latent R omanization In Multilinguality In LLM s. Findings of the Association for Computational Linguistics. 2025

2025

[17] [18]

Large language models are not robust multiple choice selectors, 2024

Zheng, Chujie and Zhou, Hao and Meng, Fandong and Zhou, Jie and Huang, Minlie. Large Language Models Are Not Robust Multiple Choice Selectors. International Conference on Representation Learning. 2024. doi:10.48550/arXiv.2309.03882

work page doi:10.48550/arxiv.2309.03882 2024

[18] [19]

Measuring Massive Multitask Language Understanding

Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt. Measuring Massive Multitask Language Understanding. International Conference on Learning Representations. 2021

2021

[19] [20]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. Proceedings of the 37th International Conference on Neural Information Proce...

2023

[20] [21]

Large Language Models Lack Understanding of Character Composition of Words

Andrew Shin and Kunitake Kaneko. Large Language Models Lack Understanding of Character Composition of Words. ICML Workshop on Large Language Models and Cognition. 2024

2024

[21] [22]

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman. GPQA : A Graduate-Level Google-Proof Q&A Benchmark. Proceedings of the First Conference on Language Modeling. 2024

2024

[22] [23]

Large language models are zero-shot reasoners

Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke. Large language models are zero-shot reasoners. Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022

2022

[23] [24]

Exploring the Potential of Prompt-Based Method for Kanji-Kana Conversion in Japanese Braille Translation

Micah Kitsunai, Deborah Watty, Shu-Kai Hsieh. Exploring the Potential of Prompt-Based Method for Kanji-Kana Conversion in Japanese Braille Translation. In the 29th Annual Meeting of Japanese Association for Natural Language Processing. 2024

2024

[24] [25]

JGLUE : J apanese General Language Understanding Evaluation

Kurihara, Kentaro and Kawahara, Daisuke and Shibata, Tomohide. JGLUE : J apanese General Language Understanding Evaluation. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022

2022

[25] [26]

Should We Respect LLM s? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance

Yin, Ziqi and Wang, Hao and Horio, Kaito and Kawahara, Daisuke and Sekine, Satoshi. Should We Respect LLM s? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance. Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024). 2024. doi:10.18653/v1/2024.sicon-1.2

work page doi:10.18653/v1/2024.sicon-1.2 2024

[26] [27]

ByT5: Towards a token-free future with pre-trained byte-to-byte models , journal =

Linting Xue and Aditya Barua and Noah Constant and Rami Al. ByT5: Towards a token-free future with pre-trained byte-to-byte models , journal =. 2021 , url =. 2105.13626 , timestamp =

work page arXiv 2021

[27] [28]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[28] [29]

Dan Gusfield , title =. 1997

1997

[29] [30]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[30] [31]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[31] [32]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [33]

DeepRapper : N eural Rap Generation with Rhyme and Rhythm Modeling

Liqiang Xue and Kaitao Song and Di Wu and Xu Tan and Ningyu Zhang and Tao Qin and Wentao Zhang and Tie-Yan Liu. DeepRapper : N eural Rap Generation with Rhyme and Rhythm Modeling. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021. doi:10....

work page doi:10.18653/v1/2021.acl-long.6 2021

[33] [34]

and Ba, J

Kingma, D. and Ba, J. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representation. 2015

2015

[34] [35]

and Addanki, K

Wu, D. and Addanki, K. Learning to Rap Battle with Bilingual Recursive Neural Networks. Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015

2015

[35] [36]

Ghost W riter: U sing an LSTM for Automatic Rap Lyric Generation

Peter Potash and Alexey Romanov and Anna Rumshisky. Ghost W riter: U sing an LSTM for Automatic Rap Lyric Generation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015. doi:https://doi.org/10.18653/v1/D15-1221

work page doi:10.18653/v1/d15-1221 2015

[36] [37]

PLATO-Ad: A Unified Advertisement Text Generation Framework with Multi-Task Prompt Learning

Zeyang Lei and Chao Zhang and Xinchao Xu and Wenquan Wu and Zheng-yu Niu and Hua Wu and Haifeng Wang and Yi Yang and Shuanglong Li. PLATO-Ad: A Unified Advertisement Text Generation Framework with Multi-Task Prompt Learning. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2022. doi:10.18653/v1/2022.e...

work page doi:10.18653/v1/2022.emnlp-industry.52 2022

[37] [38]

DopeLearning: A Computational Approach to Rap Lyrics Generation

Eetu Malmi and Pyry Takala and Hannu Toivonen and Tapani Raiko and Aristides Gionis. DopeLearning: A Computational Approach to Rap Lyrics Generation. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. doi:10.1145/2939672.2939841

work page doi:10.1145/2939672.2939841 2016

[38] [39]

Verse Generation by Reverse Generation Considering Rhyme and Answer in Japanese Rap Battles

Mibayashi, Ryota and Yamamoto, Takehiro and Tsukuda, Kosetsu and Watanabe, Kento and Nakano, Tomoyasu and Goto, Masataka and Ohshima, Hiroaki. Verse Generation by Reverse Generation Considering Rhyme and Answer in Japanese Rap Battles. Proceedings of the 16th International Symposium on Computer Music Multidisciplinary Research. 2023. doi:10.5281/zenodo.10109961

work page doi:10.5281/zenodo.10109961 2023

[39] [40]

Nikolov and Eetu Malmi and Curtis G

Nikola I. Nikolov and Eetu Malmi and Curtis G. Northcutt and Lorenzo Parisi. Rapformer: Conditional Rap Lyrics Generation with Denoising Autoencoders. Proceedings of the 13th International Conference on Natural Language Generation. 2020. doi:10.18653/v1/2020.inlg-1.42

work page doi:10.18653/v1/2020.inlg-1.42 2020