pith. sign in

arxiv: 2605.17205 · v1 · pith:KW7AZ7UAnew · submitted 2026-05-17 · 💻 cs.CL

LLMs for automatic annotation of Mandarin narrative transcripts

Pith reviewed 2026-05-20 14:50 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM annotationMandarin narrativesnarrative macrostructureMAINautomatic annotationdiscourse analysisCohen's kappaspoken language
0
0 comments X

The pith

Large language models can annotate narrative macrostructure in Mandarin speech transcripts nearly as reliably as humans while cutting time by 65%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests large language models on the task of labeling hierarchical story grammar elements in transcribed spoken Mandarin narratives. It evaluates four models against trained human annotators using the MAIN instrument across stories told by children, young adults, and older adults. The strongest model reached a Cohen's kappa of 0.794 with humans, close to the 0.872 level of agreement between humans, and completed the work 65 percent faster. Performance varied by element type and by speaker age, with young adult narratives proving more difficult due to greater lexical and semantic complexity.

Core claim

The central claim is that LLMs can reliably automate discourse-level annotation of narrative macrostructure in non-English spoken corpora. The best model achieved substantial agreement with human raters at k=.794, approaching the human-human reliability of k=.872, while reducing annotation time by 65 percent. Annotation difficulty proved systematic by macrostructure category, and model performance declined on young adult narratives that contained greater lexical variation, semantic ambiguity, and multi-element utterances.

What carries the argument

The Multilingual Assessment Instrument for Narratives (MAIN), which identifies and scores the presence and organization of story grammar elements such as setting, characters, and episodes in spoken narratives.

If this is right

  • LLMs can reduce the labor cost of building large annotated corpora for language acquisition and sociolinguistic research in Mandarin.
  • Human review remains essential for categories that require subtle semantic differentiation between macrostructure elements.
  • Lightweight locally deployable models are currently unreliable for this task and should not be used without additional safeguards.
  • Open-sourced prompt templates make it possible for other teams to apply or adapt the same approach to comparable discourse annotation problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompting strategy could be tested on narrative data in other languages to assess cross-linguistic portability of the observed reliability levels.
  • Integrating LLM annotation into existing pipelines would allow researchers to scale up the size of discourse datasets without proportional increases in manual effort.
  • Fine-tuning or few-shot adaptation on age-specific Mandarin examples might reduce the performance gap observed on young adult narratives.

Load-bearing premise

That the human annotators constitute a stable and unbiased gold standard and that the chosen child, young adult, and older adult narratives are representative enough for the performance patterns to generalize.

What would settle it

An independent team of human raters annotates the identical set of transcripts and the resulting model-human agreement falls substantially below the reported k=.794 level, or the same models are tested on a fresh collection of Mandarin narratives drawn from different populations or contexts.

read the original abstract

Linguistic annotation of transcribed speech is essential for research in language acquisition, language disorders, and sociolinguistics, yet remains labor-intensive and time-consuming. While Large Language Models (LLMs) have shown promise in automating annotation tasks, their ability to handle complex discourse-level annotation in non-English languages remains understudied. This study evaluates whether LLMs can reliably annotate narrative macrostructure-the hierarchical organization of story grammar elements-in spoken Mandarin, using the Multilingual Assessment Instrument for Narratives (MAIN) as a testbed. We compared four LLMs against trained human annotators on narratives produced by children, young adults, and older adults. The best-performing model achieved agreement with human raters (k=.794) approaching human-human reliability levels (k=.872) while reducing annotation time by 65%, whereas the locally deployable lightweight model performed substantially worse. Annotation difficulty varied systematically by macrostructure element type, with categories requiring subtle semantic differentiation posing persistent challenges. Furthermore, model reliability decreased on young adult narratives, which exhibited greater lexical variation, semantic ambiguity, and multi-element integration within single utterances. These findings suggest that LLMs can effectively support discourse-level annotation in non-English spoken corpora, while highlighting the continued need for human oversight in semantically complex tasks. Our prompt templates are open sourced for future use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates LLMs for annotating narrative macrostructure in spoken Mandarin transcripts using the MAIN framework. It compares four models to human annotators across narratives from children, young adults, and older adults, reporting that the best model reaches Cohen's kappa of 0.794 with humans (approaching human-human kappa of 0.872) and cuts annotation time by 65%. Performance varies by age group and element type, with challenges for semantically ambiguous cases, and the authors release their prompt templates.

Significance. If the results hold under fuller methodological disclosure, the work would show that LLMs can meaningfully reduce effort in discourse-level annotation for non-English spoken data, with direct relevance to language acquisition, disorders, and sociolinguistics research. The open-sourcing of prompts is a clear strength for reproducibility.

major comments (2)
  1. [Abstract] Abstract: the reported agreement (k=.794) and time reduction (65%) are presented without any description of prompt design, exact model versions, the formula or procedure used for inter-annotator kappa, or statistical tests for the time savings; these omissions prevent full verification of the central performance claim.
  2. [Evaluation] The evaluation relies on the assumption that the chosen child/young-adult/older-adult narratives are representative of broader Mandarin discourse, yet the manuscript supplies no sampling frame, speaker demographics, or topic diversity; the abstract itself notes reduced reliability on young-adult narratives due to lexical variation and semantic ambiguity, so this gap directly affects generalizability of the reported k values and time savings.
minor comments (2)
  1. Consider adding a table that breaks down kappa by macrostructure element type and by age group to make the systematic variation claims easier to inspect.
  2. The abstract states that prompts are open-sourced; ensure the repository link and exact templates appear in the main text or supplementary material with clear usage instructions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important issues of methodological transparency and the scope of our findings. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported agreement (k=.794) and time reduction (65%) are presented without any description of prompt design, exact model versions, the formula or procedure used for inter-annotator kappa, or statistical tests for the time savings; these omissions prevent full verification of the central performance claim.

    Authors: We agree that the abstract would be strengthened by including concise methodological context for the key quantitative claims. In the revised manuscript we will expand the abstract to note the iterative prompt design process based on MAIN guidelines, specify the exact model versions evaluated, clarify that inter-annotator agreement was computed with Cohen's kappa, and state that the reported time savings were assessed via paired statistical comparisons of annotation durations. Full procedural details remain in the Methods section. revision: yes

  2. Referee: [Evaluation] The evaluation relies on the assumption that the chosen child/young-adult/older-adult narratives are representative of broader Mandarin discourse, yet the manuscript supplies no sampling frame, speaker demographics, or topic diversity; the abstract itself notes reduced reliability on young-adult narratives due to lexical variation and semantic ambiguity, so this gap directly affects generalizability of the reported k values and time savings.

    Authors: We acknowledge the limitation in the current description of the corpus. The narratives were drawn from a convenience sample collected in a single metropolitan region in China. In the revision we will add a dedicated paragraph in the Methods section that specifies the sampling frame, participant demographics (age ranges, gender balance, and education levels), and the fixed set of MAIN picture-based topics used for elicitation. We will also expand the Discussion to explicitly address generalizability, noting that the lower reliability observed on young-adult narratives reflects greater lexical variation and semantic ambiguity in that subgroup and that future work should test the approach on more diverse Mandarin discourse samples. revision: yes

Circularity Check

0 steps flagged

Empirical comparison to external human annotations is self-contained with no circular reduction

full rationale

The paper reports an empirical evaluation of LLMs annotating Mandarin narrative macrostructure via the MAIN framework, measuring agreement (Cohen's k) directly against trained human annotators on held-out transcripts from children, young adults, and older adults. The central performance claims (best-model k=.794 approaching human-human k=.872, plus 65% time reduction) are computed from these external comparisons rather than from any fitted parameters, self-referential definitions, or equations that would make the outputs equivalent to the inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to justify the results; the derivation chain consists of standard inter-annotator agreement metrics applied to independent test data. This is the most common honest finding for an applied NLP evaluation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on empirical comparison rather than new theoretical constructs. The main background assumptions concern the reliability of human annotation as ground truth and the representativeness of the chosen narratives.

axioms (2)
  • domain assumption Human annotators provide a stable and unbiased reference standard for narrative macrostructure elements.
    The paper uses human-human kappa as the benchmark against which model performance is judged.
  • domain assumption The MAIN framework elements capture the relevant hierarchical organization in Mandarin spoken narratives.
    The study adopts MAIN as the testbed without additional validation of its suitability for Mandarin.

pith-pipeline@v0.9.0 · 5767 in / 1349 out tokens · 48815 ms · 2026-05-20T14:50:05.197820+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    1 LLMs for automatic annotation of Mandarin narrative transcripts Qingwen Zhao1*, Hongao Zhu2*, Yunqi He3, Rui Wang1, Aijun Huang1,4 #, Hai Hu5# 1Shanghai Jiao Tong University, China 2University of California San Diego, USA 3The Hong Kong Polytechnic University, China 4 National Research Center for Language and Well-Being, China 5City University of Hong K...

  2. [2]

    2026; Justice et al

    Which types of story grammar elements are more difficult to annotate and exhibit lower agreement for LLMs? 3 Do LLMs exhibit differential annotation performance across narrative texts from distinct age cohorts (children, young and older adults)? 2 Method 2.1 Macrostructure annotation scheme Macrostructure is a well-established measure for evaluating narra...

  3. [3]

    All transcripts were prepared in the CHAT format (MacWhinney 2000), segmented into T-units (Hunt 1965)

    Demographic characteristics of the participants Group Age (y) N Mean Age (y;m) Age Range children 3 23 3;6 3;2 - 3;11 4 25 4;4 4;0 - 4;11 5 26 5;4 5;0 - 5;11 6 22 6;4 6;0 - 6;11 7 20 7;4 7;0 - 7;11 young adults 20 20 25;1 21;10 - 31;2 older adults 60 23 64;10 60;2 - 69;11 70 24 75;9 70;3 - 79;11 80 24 83;10 80;4 - 88;2 Total 207 6 We employed proportional...

  4. [4]

    Human consensus served as the gold standard

    was reported as the human-LLM agreement. Human consensus served as the gold standard. 3 Results 3.1 Model performance, efficiency, and cost comparison Table 3 presents the inter-rater agreements among human raters and LLMs. According to Landis and Koch’s (1977) criteria, human annotators achieved almost perfect agreement (κ =.872). Notable performance dif...

  5. [5]

    Cohen’s Kappa interpretation: < 0.41 (fair); 0.41-0.60 (moderate); 0.61-0.80 (substantial); 0.81-1.00 (almost perfect)

    Inter-rater agreement across age groups (Cohen’s κ) Group Human-human Human-R1 Human-V3 Human-Qwen3 Human-Qwen14B Model mean Interpretation chi (Children) 0.869 0.773 0.782 0.748 0.613 0.729 substantial eld (Elderly) 0.865 0.846 0.721 0.691 0.405 0.666 substantial you (Young) 0.909 0.686 0.624 0.674 0.368 0.588 moderate 9 Note. Cohen’s Kappa interpretatio...

  6. [6]

    sausages

    Sample narrative transcripts of story one with macrostructure annotations ID Sentence Human 1 Human 2 R1 V3 Qwen3 Qwen14B 1 有一天小狗出来玩。‘One day the little dog came out to play’ T, I1 T, I1 T T T T 2 它发现一只老鼠。‘It found a mouse.’ I1 I1 I1 I1 I1 I1 3 想吃它 (It) wanted to eat it G1 G1 G1 G1 G1 G1 4 可是老鼠跑进了<洞 (.) 里> [//] 树洞里。‘But the mouse ran into <the hole (.)> [...

  7. [7]

    Journal of Multilingual and Multicultural Development 47(2)

    From childhood to adolescence: The growth of narrative macrostructure in heritage bilingual English speakers. Journal of Multilingual and Multicultural Development 47(2). 1071–1087. https://doi.org/10.1080/01434632.2024.2413456. Chafe, Wallace L

  8. [8]
  9. [9]

    https://doi.org/10.1016/j.jbi.2023.104478

    104478. https://doi.org/10.1016/j.jbi.2023.104478. Gagarina, Natalia, Daleen Klop, Sari Kunnari, Koula Tantele, Taina Välimaa, Ute Bohnacker & Joel Walters. 2019a. MAIN: Multilingual Assessment Instrument for Narratives – Revised. ZAS Papers in Linguistics

  10. [10]

    Gagarina, Natalia, Ute Bohnacker & Josefin Lindgren

    https://doi.org/10.21248/zaspil.63.2019.516. Gagarina, Natalia, Ute Bohnacker & Josefin Lindgren. 2019b. Macrostructural organization of adults’ oral narrative texts. ZAS Papers in Linguistics

  11. [11]

    https://doi.org/10.21248/zaspil.62.2019.449

    190–208. https://doi.org/10.21248/zaspil.62.2019.449. Garside, R. G., Geoffrey Leech & Anthony Mark McEnery

  12. [12]

    Proceedings of the National Academy of Sciences 120(30)

    ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences 120(30). e2305016120. https://doi.org/10.1073/pnas.2305016120. 27 Hunt, Kellogg W

  13. [13]

    Early Childhood Research Quarterly 25(2)

    A scalable tool for assessing children’s language abilities within a narrative context: The NAP (Narrative Assessment Protocol). Early Childhood Research Quarterly 25(2). 218–234. https://doi.org/10.1016/j.ecresq.2009.11.002. Kim, Minjin & Xiaofei Lu

  14. [14]

    Journal of English for Academic Purposes

    Exploring the potential of using ChatGPT for rhetorical move-step analysis: The impact of prompt refinement, few-shot learning, and fine-tuning. Journal of English for Academic Purposes. https://doi.org/10.1016/j.jeap.2024.101422. Leech, Geoffrey

  15. [15]

    Journal of Speech, Language, and Hearing Research 38(2)

    Measurement of narrative discourse ability in children with language disorders. Journal of Speech, Language, and Hearing Research 38(2). 415–425. https://doi.org/10.1044/jshr.3802.415. Lindgren, Josefin, Freideriki Tselekidou & Natalia Gagarina

  16. [16]

    https://doi.org/10.21248/zaspil.65.2023.623

    111–132. https://doi.org/10.21248/zaspil.65.2023.623. Luo, Jin, Wenchun Yang, Angel Chan, Kelly Cheng, Rachel Kan & Natalia Gagarina

  17. [17]

    https://doi.org/10.21248/zaspil.64.2020.569

    159–162. https://doi.org/10.21248/zaspil.64.2020.569. MacWhinney, Brian

  18. [18]

    Journal of Speech, Language, and Hearing Research 30(4)

    Story grammar ability in children with and without language disorder: Story generation, story retelling, and story comprehension. Journal of Speech, Language, and Hearing Research 30(4). 539–552. https://doi.org/10.1044/jshr.3004.539. Morin, Cameron & Matti Marttinen Larsson

  19. [19]

    In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 1832–1844

    Automatic annotation of grammaticality in child-caregiver conversations. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 1832–1844. Torino, Italia: ELRA and ICCL. https://aclanthology.org/2024.lrec-main.164/ (accessed 9 March 2026). Nippold, Marilyn A., Paige M. ...

  20. [20]

    Clinical Linguistics & Phonetics 28(3)

    Spoken language production in adults: Examining age-related differences in syntactic complexity. Clinical Linguistics & Phonetics 28(3). 195–207. https://doi.org/10.3109/02699206.2013.841292. Ostyakova, Lidiia, Veronika Smilga, Kseniia Petukhova, Maria Molchanova & Daniel Kornev

  21. [21]

    crowdsourcing vs

    ChatGPT vs. crowdsourcing vs. experts: Annotating open-domain conversations with speech functions. In Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 242–254. Prague, Czechia: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.sigdial-1.23. Sheng, Li, Huanhuan Shi, Danyang Wang, Ying...

  22. [22]

    Journal of Speech, Language, and Hearing Research 63(3)

    Narrative production in Mandarin-speaking children: Effects of language ability and elicitation method. Journal of Speech, Language, and Hearing Research 63(3). 774–792. https://doi.org/10.1044/2019_JSLHR-19-00087. Stein, Nancy & Christine Glenn

  23. [23]

    International Journal of Corpus Linguistics 29(4)

    Assessing the potential of LLM-assisted annotation for corpus-based pragmatics and discourse analysis: The case of apology. International Journal of Corpus Linguistics 29(4). 534–561. https://doi.org/10.1075/ijcl.23087.yu. Zhang, Fangfang, Allyssa McCabe, Jiaqi Ye, Yan Wang & Xiaoyan Li

  24. [24]

    Journal of Psycholinguistic Research 48(2)

    A developmental study of the narrative components and patterns of Chinese children aged 3–6 years. Journal of Psycholinguistic Research 48(2). 477–500. https://doi.org/10.1007/s10936-018-9614-3