YOMI-Bench: A Benchmark for Evaluating Kanji Reading and Phonological Understanding of LLMs for Japanese
Pith reviewed 2026-07-02 13:01 UTC · model grok-4.3
The pith
Even Japanese-specific LLMs perform poorly on kanji reading tasks according to a new benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper proposes YOMI-Bench consisting of four tasks to evaluate kanji reading performance in Japanese. Through systematic testing, it establishes that even Japanese-specific open LLMs exhibit low performance, while commercial LLMs also perform poorly on generation tasks that require consideration of kanji readings.
What carries the argument
YOMI-Bench, a benchmark of four tasks designed to measure LLMs' ability to handle multiple possible readings for individual kanji characters.
If this is right
- Even Japanese-specific open LLMs exhibit low performance on the benchmark tasks.
- Commercial LLMs perform poorly on generation tasks that require consideration of kanji readings.
- Multilingual open LLMs also show low performance due to the linguistic characteristic of multiple kanji readings.
- The benchmark reveals that inferring correct readings from surface-level text alone remains difficult for current models.
Where Pith is reading between the lines
- Training data that explicitly covers phonological variations across kanji readings could raise scores on these tasks.
- The same task design could be adapted to test reading disambiguation in other languages with character-based scripts.
- Low benchmark results suggest that Japanese text generation systems may produce incorrect readings until this gap is closed.
- Model developers could add explicit disambiguation layers focused on Japanese phonology to improve reliability.
Load-bearing premise
The four tasks in YOMI-Bench are valid and unbiased measures of kanji reading ability that reflect real-world difficulties rather than artifacts of task design.
What would settle it
A model achieving consistently high accuracy, such as above 80 percent, across all four tasks while retaining general language capabilities would falsify the reported low performance.
Figures
read the original abstract
We propose YOMI-Bench, a benchmark for evaluating kanji reading and phonological understanding of large language models (LLMs) for Japanese. In Japanese, a single kanji character often has multiple possible readings, making it difficult to infer the correct reading from surface-level text alone. Due to these linguistic characteristics, it is empirically known that LLMs exhibit low performance in kanji reading for Japanese. The proposed YOMI-Bench consists of four tasks specifically designed to evaluate kanji reading performance in Japanese. In our evaluation using YOMI-Bench, we assessed one multilingual open LLM, four Japanese-specific open LLMs, and five commercial LLMs. As a result, we found that even Japanese-specific models show low performance, and that commercial models also perform poorly on generation tasks that require consideration of kanji readings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes YOMI-Bench, a benchmark consisting of four tasks to evaluate the kanji reading and phonological understanding of LLMs for Japanese. It evaluates one multilingual open LLM, four Japanese-specific open LLMs, and five commercial LLMs, concluding that even Japanese-specific models exhibit low performance and that commercial models perform poorly on generation tasks requiring consideration of kanji readings.
Significance. If the benchmark tasks are shown to be valid and unbiased measures of kanji reading ability, this work could highlight important limitations in current LLMs for handling Japanese language specifics, which is relevant for improving multilingual and language-specific models. The evaluation across different model types provides a broad view of the issue.
major comments (1)
- [Abstract] Abstract: The abstract provides no details on task construction, data sources, metrics, or statistical significance, making it impossible to verify whether the reported low performance supports the claims about model capabilities.
minor comments (1)
- The paper could benefit from including example instances from the four tasks to illustrate the evaluation.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive comments on our manuscript. We address the major comment point-by-point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract provides no details on task construction, data sources, metrics, or statistical significance, making it impossible to verify whether the reported low performance supports the claims about model capabilities.
Authors: We agree that the current abstract is high-level and omits key details on task construction, data sources, and metrics, which are instead described in Sections 3 (Benchmark Construction) and 4 (Experiments). This is a valid observation. To address it, we will revise the abstract to concisely summarize the four tasks, their data sources (e.g., existing Japanese corpora and manually curated examples), the primary metrics (accuracy and F1 for reading prediction/generation), and note that evaluations are deterministic with no statistical significance testing applied due to the fixed benchmark design. This change will allow readers to better assess the claims directly from the abstract. revision: yes
Circularity Check
No significant circularity
full rationale
The paper proposes YOMI-Bench as a new benchmark with four tasks for evaluating kanji reading in LLMs and reports direct evaluation results on open and commercial models. No derivations, equations, fitted parameters, predictions, or uniqueness theorems are present. The claims rest on empirical performance numbers from the benchmark application itself, with no self-referential reductions or load-bearing self-citations that collapse the argument. This is a standard benchmark paper whose central content is self-contained against external model evaluations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Kayako Matsuo and Shen-Hsing Annabel Chen and Chih-Wei Hue and Chiao-Yi Wu and Epifanio Bagarinao and Wen-Yih Isaac Tseng and Toshiharu Nakai. Neural substrates of phonological selection for Japanese character Kanji based on fMRI investigations. NeuroImage. 2010. doi:10.1016/j.neuroimage.2009.12.099
-
[2]
Historical Analysis of Japanese Writing Systems Hiragana, Katakana, and Kanji
Yessy Harun and Febi Nur Biduri. Historical Analysis of Japanese Writing Systems Hiragana, Katakana, and Kanji. International Journal of Social Service and Research. 2024
2024
-
[4]
arXiv preprint arXiv:2412.14471
Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs , author=. arXiv preprint arXiv:2412.14471. 2025. doi:10.48550/arXiv.2402.01349
-
[5]
What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations
Katharina Trinley and Toshiki Nakai and Tatiana Anikina and Tanja Baeumel. What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations. arXiv preprint arXiv:2507.20279. 2025
-
[6]
Benchmax: A comprehensive multilingual evaluation suite for large language models
Xu Huang and Wenhao Zhu and Hanxu Hu and Conghui He and Lei Li and Shujian Huang and Fei Yuan. BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models. arXiv preprint arXiv:2502.07346. 2025
-
[7]
The NazoNazo Benchmark: A Cost-Effective and Extensible Test of Insight-Based Reasoning in LLMs
Masaharu Mizumoto and Dat Nguyen and Zhiheng Han and Jiyuan Fang and Heyuan Guan and Xingfu Li and Naoya Shiraishi and Xuyang Tian and Yo Nakawake and Le Minh Nguyen. The NazoNazo Benchmark: A Cost-Effective and Extensible Test of Insight-Based Reasoning in LLMs. arXiv preprint arXiv:2509.14704. 2025
-
[8]
Multilingual Large Language Models: A Systematic Survey
Shaolin Zhu and Supryadi and Shaoyang Xu and Haoran Sun and Leiyu Pan and Menglong Cui and Jiangcun Du and Renren Jin and António Branco and Deyi Xiong. Multilingual Large Language Models: A Systematic Survey. arXiv preprint arXiv:2411.11072. 2024. doi:10.48550/arXiv.2411.11072
-
[9]
A survey on large language model benchmarks, 2025
Shiwen Ni and Guhong Chen and Shuaimin Li and Xuanang Chen and Siyi Li and Bingli Wang and Qiyao Wang and Xingjian Wang and Yifan Zhang and Liyang Fan and Chengming Li and Ruifeng Xu and Le Sun and Min Yang. A Survey on Large Language Model Benchmarks. arXiv preprint arXiv:2508.15361. 2025
-
[10]
Japanese Rhyme Generation Based on Mora Similarity and Generation Probability
Mibayashi, Ryota and Yamamoto, Takehiro and Ohshima, Hiroaki. Japanese Rhyme Generation Based on Mora Similarity and Generation Probability. Proceedings of the 27th International Conference on Information Integration and Web Intelligence. 2025. doi:10.1007/978-3-032-11976-6_7
-
[11]
and Kann, Katharina and Mielke, Sabrina J
Cotterell, Ryan and Kirov, Christo and Sylak-Glassman, John and Walther, G \'e raldine and Vylomova, Ekaterina and McCarthy, Arya D. and Kann, Katharina and Mielke, Sabrina J. and Nicolai, Garrett and Silfverberg, Miikka and Yarowsky, David and Eisner, Jason and Hulden, Mans. The C o NLL -- SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection...
-
[12]
Barrault, Lo \"i c and Biesialska, Magdalena and Bojar, Ond r ej and Costa-juss \`a , Marta R. and Federmann, Christian and Graham, Yvette and Grundkiewicz, Roman and Haddow, Barry and Huck, Matthias and Joanis, Eric and Kocmi, Tom and Koehn, Philipp and Lo, Chi-kiu and Ljube s i \'c , Nikola and Monz, Christof and Morishita, Makoto and Nagata, Masaaki an...
2020
-
[13]
Training Verifiers to Solve Math Word Problems
Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John. Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168. 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[14]
Saiful and Mubasshir, Kazi and Li, Yuan-Fang and Kang, Yong-Bin and Rahman, M
Hasan, Tahmid and Bhattacharjee, Abhik and Islam, Md. Saiful and Mubasshir, Kazi and Li, Yuan-Fang and Kang, Yong-Bin and Rahman, M. Sohel and Shahriyar, Rifat. XL -Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages. Proceedings of Findings of the Association for Computational Linguistics. 2021
2021
-
[15]
Development of a question answering system focused on an encyclopedia
Sekine, Satoshi. Development of a question answering system focused on an encyclopedia. Proceedings of the 9th Annual Meeting of the Association for Natural Language Processing. 2003
2003
-
[16]
JEMH op QA : Dataset for J apanese Explainable Multi-Hop Question Answering
Ishii, Ai and Inoue, Naoya and Suzuki, Hisami and Sekine, Satoshi. JEMH op QA : Dataset for J apanese Explainable Multi-Hop Question Answering. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. 2024
2024
-
[17]
R oman L ens: The Role Of Latent R omanization In Multilinguality In LLM s
Saji, Alan and Husain, Jaavid Aktar and Jayakumar, Thanmay and Dabre, Raj and Kunchukuttan, Anoop and Puduppully, Ratish. R oman L ens: The Role Of Latent R omanization In Multilinguality In LLM s. Findings of the Association for Computational Linguistics. 2025
2025
-
[18]
Large language models are not robust multiple choice selectors, 2024
Zheng, Chujie and Zhou, Hao and Meng, Fandong and Zhou, Jie and Huang, Minlie. Large Language Models Are Not Robust Multiple Choice Selectors. International Conference on Representation Learning. 2024. doi:10.48550/arXiv.2309.03882
-
[19]
Measuring Massive Multitask Language Understanding
Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt. Measuring Massive Multitask Language Understanding. International Conference on Learning Representations. 2021
2021
-
[20]
and Zhang, Hao and Gonzalez, Joseph E
Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. Proceedings of the 37th International Conference on Neural Information Proce...
2023
-
[21]
Large Language Models Lack Understanding of Character Composition of Words
Andrew Shin and Kunitake Kaneko. Large Language Models Lack Understanding of Character Composition of Words. ICML Workshop on Large Language Models and Cognition. 2024
2024
-
[22]
David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman. GPQA : A Graduate-Level Google-Proof Q&A Benchmark. Proceedings of the First Conference on Language Modeling. 2024
2024
-
[23]
Large language models are zero-shot reasoners
Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke. Large language models are zero-shot reasoners. Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022
2022
-
[24]
Exploring the Potential of Prompt-Based Method for Kanji-Kana Conversion in Japanese Braille Translation
Micah Kitsunai, Deborah Watty, Shu-Kai Hsieh. Exploring the Potential of Prompt-Based Method for Kanji-Kana Conversion in Japanese Braille Translation. In the 29th Annual Meeting of Japanese Association for Natural Language Processing. 2024
2024
-
[25]
JGLUE : J apanese General Language Understanding Evaluation
Kurihara, Kentaro and Kawahara, Daisuke and Shibata, Tomohide. JGLUE : J apanese General Language Understanding Evaluation. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022
2022
-
[26]
Yin, Ziqi and Wang, Hao and Horio, Kaito and Kawahara, Daisuke and Sekine, Satoshi. Should We Respect LLM s? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance. Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024). 2024. doi:10.18653/v1/2024.sicon-1.2
-
[27]
ByT5: Towards a token-free future with pre-trained byte-to-byte models , journal =
Linting Xue and Aditya Barua and Noah Constant and Rami Al. ByT5: Towards a token-free future with pre-trained byte-to-byte models , journal =. 2021 , url =. 2105.13626 , timestamp =
-
[28]
Scalable training of
Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
-
[29]
Dan Gusfield , title =. 1997
1997
-
[30]
Tetreault , title =
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
2015
-
[31]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[32]
Evaluating Large Language Models Trained on Code
Evaluating Large Language Models Trained on Code , author=. arXiv preprint arXiv:2107.03374 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
DeepRapper : N eural Rap Generation with Rhyme and Rhythm Modeling
Liqiang Xue and Kaitao Song and Di Wu and Xu Tan and Ningyu Zhang and Tao Qin and Wentao Zhang and Tie-Yan Liu. DeepRapper : N eural Rap Generation with Rhyme and Rhythm Modeling. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021. doi:10....
-
[34]
and Ba, J
Kingma, D. and Ba, J. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representation. 2015
2015
-
[35]
and Addanki, K
Wu, D. and Addanki, K. Learning to Rap Battle with Bilingual Recursive Neural Networks. Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015
2015
-
[36]
Ghost W riter: U sing an LSTM for Automatic Rap Lyric Generation
Peter Potash and Alexey Romanov and Anna Rumshisky. Ghost W riter: U sing an LSTM for Automatic Rap Lyric Generation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015. doi:https://doi.org/10.18653/v1/D15-1221
-
[37]
PLATO-Ad: A Unified Advertisement Text Generation Framework with Multi-Task Prompt Learning
Zeyang Lei and Chao Zhang and Xinchao Xu and Wenquan Wu and Zheng-yu Niu and Hua Wu and Haifeng Wang and Yi Yang and Shuanglong Li. PLATO-Ad: A Unified Advertisement Text Generation Framework with Multi-Task Prompt Learning. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2022. doi:10.18653/v1/2022.e...
-
[38]
DopeLearning: A Computational Approach to Rap Lyrics Generation
Eetu Malmi and Pyry Takala and Hannu Toivonen and Tapani Raiko and Aristides Gionis. DopeLearning: A Computational Approach to Rap Lyrics Generation. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. doi:10.1145/2939672.2939841
-
[39]
Verse Generation by Reverse Generation Considering Rhyme and Answer in Japanese Rap Battles
Mibayashi, Ryota and Yamamoto, Takehiro and Tsukuda, Kosetsu and Watanabe, Kento and Nakano, Tomoyasu and Goto, Masataka and Ohshima, Hiroaki. Verse Generation by Reverse Generation Considering Rhyme and Answer in Japanese Rap Battles. Proceedings of the 16th International Symposium on Computer Music Multidisciplinary Research. 2023. doi:10.5281/zenodo.10109961
-
[40]
Nikolov and Eetu Malmi and Curtis G
Nikola I. Nikolov and Eetu Malmi and Curtis G. Northcutt and Lorenzo Parisi. Rapformer: Conditional Rap Lyrics Generation with Denoising Autoencoders. Proceedings of the 13th International Conference on Natural Language Generation. 2020. doi:10.18653/v1/2020.inlg-1.42
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.