LLMs for automatic annotation of Mandarin narrative transcripts
Pith reviewed 2026-05-20 14:50 UTC · model grok-4.3
The pith
Large language models can annotate narrative macrostructure in Mandarin speech transcripts nearly as reliably as humans while cutting time by 65%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that LLMs can reliably automate discourse-level annotation of narrative macrostructure in non-English spoken corpora. The best model achieved substantial agreement with human raters at k=.794, approaching the human-human reliability of k=.872, while reducing annotation time by 65 percent. Annotation difficulty proved systematic by macrostructure category, and model performance declined on young adult narratives that contained greater lexical variation, semantic ambiguity, and multi-element utterances.
What carries the argument
The Multilingual Assessment Instrument for Narratives (MAIN), which identifies and scores the presence and organization of story grammar elements such as setting, characters, and episodes in spoken narratives.
If this is right
- LLMs can reduce the labor cost of building large annotated corpora for language acquisition and sociolinguistic research in Mandarin.
- Human review remains essential for categories that require subtle semantic differentiation between macrostructure elements.
- Lightweight locally deployable models are currently unreliable for this task and should not be used without additional safeguards.
- Open-sourced prompt templates make it possible for other teams to apply or adapt the same approach to comparable discourse annotation problems.
Where Pith is reading between the lines
- The same prompting strategy could be tested on narrative data in other languages to assess cross-linguistic portability of the observed reliability levels.
- Integrating LLM annotation into existing pipelines would allow researchers to scale up the size of discourse datasets without proportional increases in manual effort.
- Fine-tuning or few-shot adaptation on age-specific Mandarin examples might reduce the performance gap observed on young adult narratives.
Load-bearing premise
That the human annotators constitute a stable and unbiased gold standard and that the chosen child, young adult, and older adult narratives are representative enough for the performance patterns to generalize.
What would settle it
An independent team of human raters annotates the identical set of transcripts and the resulting model-human agreement falls substantially below the reported k=.794 level, or the same models are tested on a fresh collection of Mandarin narratives drawn from different populations or contexts.
read the original abstract
Linguistic annotation of transcribed speech is essential for research in language acquisition, language disorders, and sociolinguistics, yet remains labor-intensive and time-consuming. While Large Language Models (LLMs) have shown promise in automating annotation tasks, their ability to handle complex discourse-level annotation in non-English languages remains understudied. This study evaluates whether LLMs can reliably annotate narrative macrostructure-the hierarchical organization of story grammar elements-in spoken Mandarin, using the Multilingual Assessment Instrument for Narratives (MAIN) as a testbed. We compared four LLMs against trained human annotators on narratives produced by children, young adults, and older adults. The best-performing model achieved agreement with human raters (k=.794) approaching human-human reliability levels (k=.872) while reducing annotation time by 65%, whereas the locally deployable lightweight model performed substantially worse. Annotation difficulty varied systematically by macrostructure element type, with categories requiring subtle semantic differentiation posing persistent challenges. Furthermore, model reliability decreased on young adult narratives, which exhibited greater lexical variation, semantic ambiguity, and multi-element integration within single utterances. These findings suggest that LLMs can effectively support discourse-level annotation in non-English spoken corpora, while highlighting the continued need for human oversight in semantically complex tasks. Our prompt templates are open sourced for future use.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates LLMs for annotating narrative macrostructure in spoken Mandarin transcripts using the MAIN framework. It compares four models to human annotators across narratives from children, young adults, and older adults, reporting that the best model reaches Cohen's kappa of 0.794 with humans (approaching human-human kappa of 0.872) and cuts annotation time by 65%. Performance varies by age group and element type, with challenges for semantically ambiguous cases, and the authors release their prompt templates.
Significance. If the results hold under fuller methodological disclosure, the work would show that LLMs can meaningfully reduce effort in discourse-level annotation for non-English spoken data, with direct relevance to language acquisition, disorders, and sociolinguistics research. The open-sourcing of prompts is a clear strength for reproducibility.
major comments (2)
- [Abstract] Abstract: the reported agreement (k=.794) and time reduction (65%) are presented without any description of prompt design, exact model versions, the formula or procedure used for inter-annotator kappa, or statistical tests for the time savings; these omissions prevent full verification of the central performance claim.
- [Evaluation] The evaluation relies on the assumption that the chosen child/young-adult/older-adult narratives are representative of broader Mandarin discourse, yet the manuscript supplies no sampling frame, speaker demographics, or topic diversity; the abstract itself notes reduced reliability on young-adult narratives due to lexical variation and semantic ambiguity, so this gap directly affects generalizability of the reported k values and time savings.
minor comments (2)
- Consider adding a table that breaks down kappa by macrostructure element type and by age group to make the systematic variation claims easier to inspect.
- The abstract states that prompts are open-sourced; ensure the repository link and exact templates appear in the main text or supplementary material with clear usage instructions.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important issues of methodological transparency and the scope of our findings. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported agreement (k=.794) and time reduction (65%) are presented without any description of prompt design, exact model versions, the formula or procedure used for inter-annotator kappa, or statistical tests for the time savings; these omissions prevent full verification of the central performance claim.
Authors: We agree that the abstract would be strengthened by including concise methodological context for the key quantitative claims. In the revised manuscript we will expand the abstract to note the iterative prompt design process based on MAIN guidelines, specify the exact model versions evaluated, clarify that inter-annotator agreement was computed with Cohen's kappa, and state that the reported time savings were assessed via paired statistical comparisons of annotation durations. Full procedural details remain in the Methods section. revision: yes
-
Referee: [Evaluation] The evaluation relies on the assumption that the chosen child/young-adult/older-adult narratives are representative of broader Mandarin discourse, yet the manuscript supplies no sampling frame, speaker demographics, or topic diversity; the abstract itself notes reduced reliability on young-adult narratives due to lexical variation and semantic ambiguity, so this gap directly affects generalizability of the reported k values and time savings.
Authors: We acknowledge the limitation in the current description of the corpus. The narratives were drawn from a convenience sample collected in a single metropolitan region in China. In the revision we will add a dedicated paragraph in the Methods section that specifies the sampling frame, participant demographics (age ranges, gender balance, and education levels), and the fixed set of MAIN picture-based topics used for elicitation. We will also expand the Discussion to explicitly address generalizability, noting that the lower reliability observed on young-adult narratives reflects greater lexical variation and semantic ambiguity in that subgroup and that future work should test the approach on more diverse Mandarin discourse samples. revision: yes
Circularity Check
Empirical comparison to external human annotations is self-contained with no circular reduction
full rationale
The paper reports an empirical evaluation of LLMs annotating Mandarin narrative macrostructure via the MAIN framework, measuring agreement (Cohen's k) directly against trained human annotators on held-out transcripts from children, young adults, and older adults. The central performance claims (best-model k=.794 approaching human-human k=.872, plus 65% time reduction) are computed from these external comparisons rather than from any fitted parameters, self-referential definitions, or equations that would make the outputs equivalent to the inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to justify the results; the derivation chain consists of standard inter-annotator agreement metrics applied to independent test data. This is the most common honest finding for an applied NLP evaluation paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Human annotators provide a stable and unbiased reference standard for narrative macrostructure elements.
- domain assumption The MAIN framework elements capture the relevant hierarchical organization in Mandarin spoken narratives.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The best-performing model achieved agreement with human raters (κ=.794) approaching human-human reliability levels (κ=.872) while reducing annotation time by 65%.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
1 LLMs for automatic annotation of Mandarin narrative transcripts Qingwen Zhao1*, Hongao Zhu2*, Yunqi He3, Rui Wang1, Aijun Huang1,4 #, Hai Hu5# 1Shanghai Jiao Tong University, China 2University of California San Diego, USA 3The Hong Kong Polytechnic University, China 4 National Research Center for Language and Well-Being, China 5City University of Hong K...
work page 1997
-
[2]
Which types of story grammar elements are more difficult to annotate and exhibit lower agreement for LLMs? 3 Do LLMs exhibit differential annotation performance across narrative texts from distinct age cohorts (children, young and older adults)? 2 Method 2.1 Macrostructure annotation scheme Macrostructure is a well-established measure for evaluating narra...
work page 2026
-
[3]
Demographic characteristics of the participants Group Age (y) N Mean Age (y;m) Age Range children 3 23 3;6 3;2 - 3;11 4 25 4;4 4;0 - 4;11 5 26 5;4 5;0 - 5;11 6 22 6;4 6;0 - 6;11 7 20 7;4 7;0 - 7;11 young adults 20 20 25;1 21;10 - 31;2 older adults 60 23 64;10 60;2 - 69;11 70 24 75;9 70;3 - 79;11 80 24 83;10 80;4 - 88;2 Total 207 6 We employed proportional...
work page 2000
-
[4]
Human consensus served as the gold standard
was reported as the human-LLM agreement. Human consensus served as the gold standard. 3 Results 3.1 Model performance, efficiency, and cost comparison Table 3 presents the inter-rater agreements among human raters and LLMs. According to Landis and Koch’s (1977) criteria, human annotators achieved almost perfect agreement (κ =.872). Notable performance dif...
work page 1977
-
[5]
Inter-rater agreement across age groups (Cohen’s κ) Group Human-human Human-R1 Human-V3 Human-Qwen3 Human-Qwen14B Model mean Interpretation chi (Children) 0.869 0.773 0.782 0.748 0.613 0.729 substantial eld (Elderly) 0.865 0.846 0.721 0.691 0.405 0.666 substantial you (Young) 0.909 0.686 0.624 0.674 0.368 0.588 moderate 9 Note. Cohen’s Kappa interpretatio...
work page 2023
-
[6]
Sample narrative transcripts of story one with macrostructure annotations ID Sentence Human 1 Human 2 R1 V3 Qwen3 Qwen14B 1 有一天小狗出来玩。‘One day the little dog came out to play’ T, I1 T, I1 T T T T 2 它发现一只老鼠。‘It found a mouse.’ I1 I1 I1 I1 I1 I1 3 想吃它 (It) wanted to eat it G1 G1 G1 G1 G1 G1 4 可是老鼠跑进了<洞 (.) 里> [//] 树洞里。‘But the mouse ran into <the hole (.)> [...
work page 1979
-
[7]
Journal of Multilingual and Multicultural Development 47(2)
From childhood to adolescence: The growth of narrative macrostructure in heritage bilingual English speakers. Journal of Multilingual and Multicultural Development 47(2). 1071–1087. https://doi.org/10.1080/01434632.2024.2413456. Chafe, Wallace L
-
[8]
37–46. https://doi.org/10.1177/001316446002000104. Frei, Johann & Frank Kramer
-
[9]
https://doi.org/10.1016/j.jbi.2023.104478
104478. https://doi.org/10.1016/j.jbi.2023.104478. Gagarina, Natalia, Daleen Klop, Sari Kunnari, Koula Tantele, Taina Välimaa, Ute Bohnacker & Joel Walters. 2019a. MAIN: Multilingual Assessment Instrument for Narratives – Revised. ZAS Papers in Linguistics
-
[10]
Gagarina, Natalia, Ute Bohnacker & Josefin Lindgren
https://doi.org/10.21248/zaspil.63.2019.516. Gagarina, Natalia, Ute Bohnacker & Josefin Lindgren. 2019b. Macrostructural organization of adults’ oral narrative texts. ZAS Papers in Linguistics
-
[11]
https://doi.org/10.21248/zaspil.62.2019.449
190–208. https://doi.org/10.21248/zaspil.62.2019.449. Garside, R. G., Geoffrey Leech & Anthony Mark McEnery
-
[12]
Proceedings of the National Academy of Sciences 120(30)
ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences 120(30). e2305016120. https://doi.org/10.1073/pnas.2305016120. 27 Hunt, Kellogg W
-
[13]
Early Childhood Research Quarterly 25(2)
A scalable tool for assessing children’s language abilities within a narrative context: The NAP (Narrative Assessment Protocol). Early Childhood Research Quarterly 25(2). 218–234. https://doi.org/10.1016/j.ecresq.2009.11.002. Kim, Minjin & Xiaofei Lu
-
[14]
Journal of English for Academic Purposes
Exploring the potential of using ChatGPT for rhetorical move-step analysis: The impact of prompt refinement, few-shot learning, and fine-tuning. Journal of English for Academic Purposes. https://doi.org/10.1016/j.jeap.2024.101422. Leech, Geoffrey
-
[15]
Journal of Speech, Language, and Hearing Research 38(2)
Measurement of narrative discourse ability in children with language disorders. Journal of Speech, Language, and Hearing Research 38(2). 415–425. https://doi.org/10.1044/jshr.3802.415. Lindgren, Josefin, Freideriki Tselekidou & Natalia Gagarina
-
[16]
https://doi.org/10.21248/zaspil.65.2023.623
111–132. https://doi.org/10.21248/zaspil.65.2023.623. Luo, Jin, Wenchun Yang, Angel Chan, Kelly Cheng, Rachel Kan & Natalia Gagarina
-
[17]
https://doi.org/10.21248/zaspil.64.2020.569
159–162. https://doi.org/10.21248/zaspil.64.2020.569. MacWhinney, Brian
-
[18]
Journal of Speech, Language, and Hearing Research 30(4)
Story grammar ability in children with and without language disorder: Story generation, story retelling, and story comprehension. Journal of Speech, Language, and Hearing Research 30(4). 539–552. https://doi.org/10.1044/jshr.3004.539. Morin, Cameron & Matti Marttinen Larsson
-
[19]
Automatic annotation of grammaticality in child-caregiver conversations. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 1832–1844. Torino, Italia: ELRA and ICCL. https://aclanthology.org/2024.lrec-main.164/ (accessed 9 March 2026). Nippold, Marilyn A., Paige M. ...
work page 2024
-
[20]
Clinical Linguistics & Phonetics 28(3)
Spoken language production in adults: Examining age-related differences in syntactic complexity. Clinical Linguistics & Phonetics 28(3). 195–207. https://doi.org/10.3109/02699206.2013.841292. Ostyakova, Lidiia, Veronika Smilga, Kseniia Petukhova, Maria Molchanova & Daniel Kornev
-
[21]
ChatGPT vs. crowdsourcing vs. experts: Annotating open-domain conversations with speech functions. In Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 242–254. Prague, Czechia: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.sigdial-1.23. Sheng, Li, Huanhuan Shi, Danyang Wang, Ying...
-
[22]
Journal of Speech, Language, and Hearing Research 63(3)
Narrative production in Mandarin-speaking children: Effects of language ability and elicitation method. Journal of Speech, Language, and Hearing Research 63(3). 774–792. https://doi.org/10.1044/2019_JSLHR-19-00087. Stein, Nancy & Christine Glenn
-
[23]
International Journal of Corpus Linguistics 29(4)
Assessing the potential of LLM-assisted annotation for corpus-based pragmatics and discourse analysis: The case of apology. International Journal of Corpus Linguistics 29(4). 534–561. https://doi.org/10.1075/ijcl.23087.yu. Zhang, Fangfang, Allyssa McCabe, Jiaqi Ye, Yan Wang & Xiaoyan Li
-
[24]
Journal of Psycholinguistic Research 48(2)
A developmental study of the narrative components and patterns of Chinese children aged 3–6 years. Journal of Psycholinguistic Research 48(2). 477–500. https://doi.org/10.1007/s10936-018-9614-3
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.