pith. sign in

arxiv: 2606.07300 · v1 · pith:WSCLI5CCnew · submitted 2026-06-05 · 💻 cs.CL

Phun-Bench: Evaluating LLMs on Phonological Understanding in Chinese

Pith reviewed 2026-06-27 22:17 UTC · model grok-4.3

classification 💻 cs.CL
keywords phonological understandinglarge language modelsChinese languagebenchmarkhomophonyrhymephonetic similarity
0
0 comments X

The pith

LLMs recall Chinese pronunciations accurately but struggle to apply phonological knowledge flexibly like humans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Phun-Bench, a benchmark with tasks across homophony, rhyme, and phonetic similarity to measure phonological understanding in Chinese. It finds that models succeed at recalling correct pronunciations yet fail to use this knowledge in the flexible, intuitive manner human speakers employ. This distinction matters because language connects sounds directly to meaning and generation, so gaps in phonological handling limit model performance on sound-based reasoning. The benchmark uses diverse settings to reduce the chance that success comes from memorization alone. Authors further propose a hypothesis on the mechanism of how models handle phonological perception.

Core claim

While LLMs excel at recalling correct pronunciations, they generally struggle to leverage phonological knowledge in the flexible and intuitive way that human speakers do.

What carries the argument

Phun-Bench, a benchmark with diverse tasks and settings across the dimensions of Homophony, Rhyme, and Phonetic Similarity.

If this is right

  • LLM training approaches may need to incorporate explicit mechanisms for handling sound patterns beyond recall.
  • Evaluation protocols for language models should include tests that separate pronunciation memory from applied phonological reasoning.
  • Improvements in phonological handling could enhance model performance on generation tasks that rely on rhyme, homophony, or sound similarity.
  • The proposed hypothesis on LLM phonological perception offers a direction for targeted analysis of internal representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Comparable benchmarks in other languages could test whether the observed gap is specific to Chinese or reflects a broader architectural limit.
  • Models that process audio input directly might bypass some of the limitations seen in text-only phonological tasks.
  • The gap suggests potential value in hybrid systems that combine symbolic phonological rules with neural learning.

Load-bearing premise

The benchmark tasks isolate genuine phonological understanding and cannot be solved through rote memorization of pronunciations.

What would settle it

An LLM achieving human-level scores on all Phun-Bench tasks while demonstrating non-memorized, flexible application of phonological rules would falsify the central claim of struggle.

Figures

Figures reproduced from arXiv: 2606.07300 by Weiming Lu, Xing Yue, Yongliang Shen.

Figure 1
Figure 1. Figure 1: (a) Our question pattern design compared to those in prior work. (b) Each pinyin is decomposed into [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data samples from Phun-Bench. This results in 1,684 (s, wpun, wtarget) tuples. 4.3 Dimension 2: Rhyme-Based Understanding We present the Rhyming Sentence Generation (RSG) task to assess LLMs’ rhyme understanding. Task Formulation Generating rhyming sen￾tences requires an understanding of the language’s phonological system and the relationships between rhyming words. Specifically, we condition LLMs on a ref… view at source ↗
Figure 3
Figure 3. Figure 3: Results of the two homophony-based understanding tasks—HR and CHR. In the right panel, light bars [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the hypothesized mechanism of [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Results of the SC task in word form. L, Q and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Average Similarity Comparison (SC) results [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The proportions of error types in the 2-rhyme setting. ES denotes the example sentence; GS denotes the [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Results of the SC task in pinyin form. L, Q [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overall performance scaling of Qwen3 models across different sizes (0.6B to 235B-A22B) and modes (Non-Thinking vs. Thinking) [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison between performances under the pinyin-form and word-form settings. L, Q and DS denote [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Similarity Comparison results for audio models across perturbation settings. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: A correct reasoning trace for the Homophone Recall task from [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: An incorrect reasoning trace for the Homophone Recall task from [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: A correct reasoning trace for the Similarity Comparison task ( [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: An incorrect reasoning trace for the Similarity Comparison task ( [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompt template for the character-to-pinyin verification experiment (§ [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt template for the Homophone Recall task (§ [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompt template for the Contextual Homophone Recognition task (§ [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Prompt template for the pinyin2char test in RQ 1 (§ [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Prompt template for the char2pinyin test in RQ 1 (§ [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Prompt template for the word2pinyin test in RQ 1 (§ [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Prompt template for the pinyin2word test in RQ 1 (§ [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Prompt template for the word2pinyin test in RQ 1 (§ [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Prompt template for the pinyin2word test in RQ 1 (§ [PITH_FULL_IMAGE:figures/full_fig_p035_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Prompt template for the Rhyming Sentence Generation task (1-rhyme setting, § [PITH_FULL_IMAGE:figures/full_fig_p037_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Prompt template for the Rhyming Sentence Generation task (2-rhyme setting, § [PITH_FULL_IMAGE:figures/full_fig_p039_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Prompt template for the Similarity Comparison task (in word form, § [PITH_FULL_IMAGE:figures/full_fig_p041_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Prompt template for the Similarity Comparison task (in pinyin form, § [PITH_FULL_IMAGE:figures/full_fig_p043_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Speech template used for text-to-speech in the audio-model evaluation (Homophone Recall). [PITH_FULL_IMAGE:figures/full_fig_p044_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Prompt template for in the audio-model experiment (§ [PITH_FULL_IMAGE:figures/full_fig_p044_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Instruction for the character-to-pinyin verification experiment (§ [PITH_FULL_IMAGE:figures/full_fig_p044_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Instruction for the Homophone Recall task (§ [PITH_FULL_IMAGE:figures/full_fig_p045_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Instruction for the Contextual Homophone Recognition task (§ [PITH_FULL_IMAGE:figures/full_fig_p045_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Instruction for the Similarity Comparison task (in word form, § [PITH_FULL_IMAGE:figures/full_fig_p046_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Instruction for the Similarity Comparison task (in pinyin form, § [PITH_FULL_IMAGE:figures/full_fig_p046_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Screenshot of the Similarity Comparison .xlsx file provided to human participants [PITH_FULL_IMAGE:figures/full_fig_p046_36.png] view at source ↗
read the original abstract

Language is a vehicle for thought, intricately tied to sounds, symbols, and meaning. However, most large language model (LLM) research focuses on meaning (semantics) and symbols (spelling) while largely overlooking sounds. Existing benchmarks on LLMs' phonological abilities are either solvable through rote memorization or intertwined with other abilities, making them inadequate to measure LLMs' genuine ability in phonological understanding. Here, we present Phun-Bench, a purpose-built Chinese benchmark with diverse tasks and settings across three dimensions (Homophony, Rhyme, and Phonetic Similarity), designed to systematically evaluate LLMs' phonological understanding. Our results show that while LLMs excel at recalling correct pronunciations, they generally struggle to leverage phonological knowledge in the flexible and intuitive way that human speakers do. Moreover, through detailed analyses, we propose a hypothesis regarding the underlying mechanism of LLMs' phonological understanding and "perception", highlighting an underexplored frontier for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Phun-Bench, a Chinese-language benchmark with tasks across three dimensions (Homophony, Rhyme, Phonetic Similarity) intended to evaluate LLMs' phonological understanding beyond rote memorization. It reports that models recall correct pronunciations effectively but struggle to apply phonological knowledge flexibly in the manner of human speakers, and offers a hypothesis on the underlying mechanism of LLM phonological 'perception' based on detailed analyses.

Significance. If the benchmark tasks genuinely isolate flexible phonological reasoning from semantic, orthographic, or lexical cues, the work would usefully document a limitation in current LLMs and supply an evaluation resource for an underexplored capability. The proposed mechanism hypothesis could stimulate targeted follow-up experiments.

major comments (2)
  1. [Benchmark construction] Benchmark construction section: the three task dimensions are presented as isolating phonological manipulation, yet the manuscript does not describe explicit controls (e.g., distractors matched on semantics or orthography but differing in phonology, or human baselines restricted to phonological cues). Without such controls, low model performance could arise from failure to exploit non-phonological regularities in the Chinese training distribution rather than from inability to apply phonological knowledge flexibly, directly undermining the central claim.
  2. [Results and analysis] Results and analysis sections: the reported superiority on pronunciation recall versus the three Phun-Bench dimensions is load-bearing for the 'flexible application' conclusion, but no ablation or error analysis is described that rules out lexical co-occurrence statistics as an alternative explanation for model failures.
minor comments (2)
  1. [Abstract] The abstract states that existing benchmarks are 'solvable through rote memorization,' but the manuscript does not cite or compare against specific prior phonological benchmarks with quantitative evidence.
  2. [Benchmark description] Notation for the three dimensions is introduced without a summary table listing task formats, number of items, and example stimuli, which would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help clarify the scope and limitations of our benchmark. We address each major point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction section: the three task dimensions are presented as isolating phonological manipulation, yet the manuscript does not describe explicit controls (e.g., distractors matched on semantics or orthography but differing in phonology, or human baselines restricted to phonological cues). Without such controls, low model performance could arise from failure to exploit non-phonological regularities in the Chinese training distribution rather than from inability to apply phonological knowledge flexibly, directly undermining the central claim.

    Authors: The referee is correct that the current manuscript does not provide an explicit description of controls for semantic or orthographic cues. While the task items were constructed to minimize such confounds through the use of minimal pairs and frequency-matched items, this design rationale is not sufficiently documented. We will revise the benchmark construction section to include a dedicated subsection on controls (e.g., semantic/orthographic distractors and human baselines collected under phonological-only presentation), along with any quantitative verification that non-phonological cues alone do not solve the tasks. revision: yes

  2. Referee: [Results and analysis] Results and analysis sections: the reported superiority on pronunciation recall versus the three Phun-Bench dimensions is load-bearing for the 'flexible application' conclusion, but no ablation or error analysis is described that rules out lexical co-occurrence statistics as an alternative explanation for model failures.

    Authors: We agree that the absence of targeted ablations leaves open the possibility that model failures reflect lexical co-occurrence patterns rather than a lack of flexible phonological reasoning. In the revision we will add an error-analysis subsection that (1) correlates model errors with lexical n-gram statistics from the training corpus and (2) reports performance on a control set of items matched for co-occurrence but differing in required phonological manipulation. This will allow readers to assess whether the performance gap persists after accounting for statistical regularities. revision: yes

Circularity Check

0 steps flagged

No derivation chain or fitted quantities; benchmark evaluation is self-contained

full rationale

The paper presents an empirical benchmark (Phun-Bench) for LLM phonological tasks across three dimensions, with results reported directly from model evaluations. No equations, derivations, parameter fittings, or predictions appear in the provided text or abstract. Claims rest on observed performance differences rather than any chain that reduces to self-definition, self-citation, or renamed inputs. The evaluation is therefore self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no technical content on parameters, axioms, or new entities.

pith-pipeline@v0.9.1-grok · 5698 in / 847 out tokens · 18787 ms · 2026-06-27T22:17:10.994745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 4 linked inside Pith

  1. [1]

    Shiyang Cheng, Pengcheng Zhu, Jueting Liu, and Zehua Wang

    ACM. Shiyang Cheng, Pengcheng Zhu, Jueting Liu, and Zehua Wang. 2024. A Survey of Grapheme-to- Phoneme Conversion Methods.Applied Sciences, 14(24):11790. Geoffrey Cideron, Sertan Girgin, Mauro Verzetti, Damien Vincent, Matej Kastelic, Zalán Borsos, Brian McWilliams, Victor Ungureanu, Olivier Bachem, Olivier Pietquin, Matthieu Geist, Léonard Hussenot, Neil...

  2. [2]

    Avia Efrat, Or Honovich, and Omer Levy

    Deepseek-V3 Technical Report.arXiv.org, abs/2412.19437. Avia Efrat, Or Honovich, and Omer Levy. 2023. LMen- try: A language model benchmark of elementary language tasks. InFindings of the Association for Computational Linguistics: ACL 2023, pages 10476– 10501, Toronto, Canada. Association for Computa- tional Linguistics. Anjie Fang, Simone Filice, Nut Lim...

  3. [3]

    Jackson L

    Academy & Industry Research Collaboration Center. Jackson L. Lee, Lucas F. E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, and Kyle Gorman. 2020. Massively Mul- tilingual Pronunciation Modeling with WikiPron. In International Conference on Language Resources and Evaluation (LREC). Shunwei Lei, Yaoxun Xu, Zhiwei Li...

  4. [4]

    Junyu Lu, Bo Xu, Xiaokun Zhang, Changrong Min, Liang Yang, and Hongfei Lin

    Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes.arXiv. Junyu Lu, Bo Xu, Xiaokun Zhang, Changrong Min, Liang Yang, and Hongfei Lin. 2023. Facilitating fine-grained detection of Chinese toxic language: Hi- erarchical taxonomy, resources, and benchmarks. In Proceedings of the 61st Annual Me...

  5. [5]

    Ziqian Ning, Huakang Chen, Yuepeng Jiang, Chunbo Hao, Guobin Ma, Shuai Wang, Jixun Yao, and Lei Xie

    ACM. Ziqian Ning, Huakang Chen, Yuepeng Jiang, Chunbo Hao, Guobin Ma, Shuai Wang, Jixun Yao, and Lei Xie. 2025. Diffrhythm: Blazingly Fast and Embar- rassingly Simple End-to-End Full-Length Song Gen- eration with Latent Diffusion.arXiv. open-dict data. ipa-dict - monolingual wordlists with pronunciation information in ipa [online]. OpenAI. 2025. Openai o3...

  6. [6]

    In Findings of the Association for Computational Lin- guistics: ACL 2023, pages 12863–12880, Toronto, Canada

    SongRewriter: A Chinese song rewriting sys- tem with controllable content and rhyme scheme. In Findings of the Association for Computational Lin- guistics: ACL 2023, pages 12863–12880, Toronto, Canada. Association for Computational Linguistics. Ashima Suvarna, Harshita Khandelwal, and Nanyun Peng. 2024. PhonologyBench: Evaluating phono- logical skills of ...

  7. [7]

    a good pun is its own reword

    “a good pun is its own reword”: Can large language models understand puns? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11766–11782, Miami, Florida, USA. Association for Computational Linguistics. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, C...

  8. [8]

    一直都这 样孤独地来 去

    were developed to generate verses end-to- end, conditioned on stylistic attributes like rhyme and meter. With the advancement of LLMs, poems generated by GPT-3.5 are even rated more favor- ably than human-written ones in certain aspects such as rhythm (Porter and Machery, 2024). Addi- tionally, Chen et al. (2024b) examine diversity in automatic poetry gen...

  9. [9]

    “案”拼音是an,可能的汉字有案、安、暗等。

  10. [10]

    “怖”拼音是bu,可能的汉字有怖、布、步等。

  11. [11]

    “臼”拼音是jiu,可能的汉字有臼、旧、就等。

  12. [12]

    案”可能对应“安

    “颁”拼音是ban,可能的汉字有颁、班、搬等。 接下来,我需要将这些可能的汉字组合成四字成语,并且每个字的拼音要与原短语中的 拼音一致。例如,第一个字是an,第二个是bu,第三个是jiu,第四个是ban。那么可能 的组合是:安布旧班?但这样组合起来似乎没有意义,可能需要调整每个字的可能选 项。 或者,可能每个拼音对应的字不是直接对应,而是每个字的拼音与原短语中的拼音 相同,但字形不同。例如,“案”可能对应“安”,“怖”对应“布”,“臼”对应“旧”,“颁”对 应“班”,这样组合成“安布旧班”,但这样的成语不存在。 可能我需要重新考虑每个拼音对应的字。例如,“案”可能对应“按”,但拼音是an,所 以可能没问题。或者“怖”可能对应“捕”(bu),但拼音是bu,所以可能。然后“臼”可能 对应“就”(jiu),而“...