pith. sign in

arxiv: 2606.25383 · v1 · pith:KBU6AZ7Xnew · submitted 2026-06-24 · 💻 cs.CL

Introducing corpora Hlava Cor and Hlava AD: Human Label Variation in Coreference and Discourse Relations

Pith reviewed 2026-06-25 21:12 UTC · model grok-4.3

classification 💻 cs.CL
keywords coreference annotationdiscourse relationshuman label variationinter-annotator agreementCzech corporatext coherenceannotator disagreement
0
0 comments X

The pith

Two Czech corpora with multiple annotations document substantial human variation in identifying coreference and discourse relations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors created two corpora of Czech texts to explore how different people interpret the same text when identifying coreference links and discourse relations. Hlava Cor contains 1,024 contexts annotated by three people each, covering pronouns, noun phrases, and adverbials across text types. Hlava AD has 512 contexts annotated by five people focusing on attributive and non-attributive discourse constructions. Both show inter-annotator agreement around 60-65%, with comments revealing varying confidence and reading strategies. This provides resources to study why text coherence understanding differs among individuals and how that affects computational models.

Core claim

The paper establishes two new resources, Hlava Cor and Hlava AD, that capture human label variation through parallel annotations of coreference and discourse phenomena in Czech, accompanied by annotator explanations, with agreement levels of approximately 60-65% that are lower where automatic models disagree.

What carries the argument

The central objects are the Hlava Cor corpus for coreference variation and the Hlava AD corpus for discourse relation variation, each with multiple independent annotations and free-text explanations of annotator choices.

If this is right

  • Agreement drops in cases where automatic coreference resolution models disagree, indicating those instances are harder for humans too.
  • Annotator comments show differences in interpretation levels and individual strategies for understanding text.
  • The corpora enable study of variation across grammatical-semantic categories and text types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These resources could support development of models that output distributions over possible human interpretations instead of single labels.
  • Extending the method to other languages or phenomena like named entity recognition might reveal universal patterns in human disagreement.
  • The lower agreement on model-disagreed cases suggests a way to identify ambiguous examples for targeted data collection.

Load-bearing premise

The selected contexts and the three or five annotators per item are sufficient to represent genuine differences in human text understanding rather than effects specific to the guidelines or sample.

What would settle it

A replication with a different group of annotators on the same contexts yielding substantially different disagreement patterns or agreement rates would indicate the variation is not representative.

read the original abstract

As previous research on annotator disagreement in discourse phenomena has shown, understanding text coherence varies considerably from one individual to another. To explore this phenomenon, we created two corpora with multiple annotations of Czech texts, accompanied by annotators' explanations of their choices. The first corpus consists of 1,024 contexts annotated in parallel by three annotators. It captures differences in the identification of coreference across various text types and grammatical-semantic categories, including pronouns, full noun phrases, and anaphoric adverbials. The second corpus comprises 512 contexts, annotated in parallel by five annotators, and focuses on identifying discourse relations in attributive and non-attributive constructions. Both corpora achieve a comparable inter-annotator agreement of approximately 60-65%. For coreference annotation, agreement tends to be lower in cases where automatic coreference resolution models disagree, suggesting that when the models disagree, the examples tend to be more difficult or ambiguous for human annotators to interpret. The annotators' comments, both for coreference and discourse relations, further reveal differences in interpretation, varying levels of confidence in text understanding, and individual reading strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces two new multi-annotated Czech corpora for studying human label variation: Hlava Cor (1,024 contexts annotated in parallel by three annotators for coreference across text types and grammatical-semantic categories including pronouns, noun phrases, and anaphoric adverbials) and Hlava AD (512 contexts annotated in parallel by five annotators for discourse relations in attributive and non-attributive constructions). Both include annotators' explanations of choices. The central claims are that the corpora achieve comparable IAA of approximately 60-65%, that agreement is lower where automatic coreference models disagree, and that the annotations and comments reveal differences in interpretation, confidence, and reading strategies.

Significance. If the reported IAA and variation patterns hold under scrutiny, the corpora would provide a useful resource for Czech discourse annotation and for research on annotator disagreement, particularly by including explanations that enable qualitative analysis of individual strategies. This aligns with broader efforts in the field to document and model human variation rather than assuming single gold labels.

major comments (2)
  1. [Abstract] Abstract: the claim of 'comparable inter-annotator agreement of approximately 60-65%' provides no exact metric (e.g., pairwise agreement, Cohen's kappa, or F1), no details on how it was computed across the three/five annotators, and no statistical tests, which is load-bearing for the central descriptive claims about the corpora capturing genuine variation.
  2. [Abstract] Abstract: no independent validation (e.g., zero-guideline re-annotation or cross-protocol comparison) is described to establish that the 60-65% IAA and reported differences by text type, category, and model disagreement reflect natural human understanding variation rather than artifacts of the annotation protocol, training, or sampling of the 1,024/512 contexts.
minor comments (1)
  1. The abstract refers to 'various text types and grammatical-semantic categories' without enumerating them; adding a brief list or reference to the relevant table or section would improve clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater precision in the abstract. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'comparable inter-annotator agreement of approximately 60-65%' provides no exact metric (e.g., pairwise agreement, Cohen's kappa, or F1), no details on how it was computed across the three/five annotators, and no statistical tests, which is load-bearing for the central descriptive claims about the corpora capturing genuine variation.

    Authors: We agree the abstract is insufficiently precise on this point. The full paper reports pairwise agreement (averaged across annotator pairs) as the primary metric, with separate breakdowns for the two corpora; we will revise the abstract to state the exact figures, computation method, and note that no statistical significance tests were applied to the overall IAA values. This change will be made in the next version. revision: yes

  2. Referee: [Abstract] Abstract: no independent validation (e.g., zero-guideline re-annotation or cross-protocol comparison) is described to establish that the 60-65% IAA and reported differences by text type, category, and model disagreement reflect natural human understanding variation rather than artifacts of the annotation protocol, training, or sampling of the 1,024/512 contexts.

    Authors: The manuscript does not include independent validation experiments such as zero-guideline re-annotation. The corpora follow standard annotation guidelines for Czech coreference and discourse relations, and the provided annotator explanations enable qualitative examination of interpretation differences. We will add an explicit limitations paragraph acknowledging that protocol or sampling effects cannot be fully excluded without further studies, while maintaining that the multi-annotator design with explanations still offers a valuable resource for studying variation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely descriptive corpus creation

full rationale

The paper introduces two new corpora with parallel annotations (1,024 contexts by 3 annotators for coreference; 512 by 5 for discourse relations) and reports observed IAA (~60-65%) plus breakdowns by text type, category, and model disagreement. No equations, derivations, fitted parameters, predictions, or uniqueness theorems appear. No self-citations are used to justify central claims. The work is self-contained descriptive annotation reporting with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a resource-creation paper; no free parameters, mathematical axioms, or invented entities are involved.

pith-pipeline@v0.9.1-grok · 5768 in / 1034 out tokens · 25974 ms · 2026-06-25T21:12:25.027592+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 1 linked inside Pith

  1. [1]

    Introduction During our long-term development of corpora fo- cused on discourse relations, coreference, and in- formation structure, we observed that certain lin- guistic phenomena tend to produce lower inter- annotator agreement than others. This observation aligns with a growing body of research on anno- tation evaluation taking into account different p...

  2. [2]

    ecologically valid

    Related Work The interpretation of linguistic meaning is never entirely fixed or uniform; it is shaped by context, perspective, and the inherent ambiguity of natu- ral language. This fundamental indeterminacy in- fluences all levels of linguistic analysis, including discourse and coreference annotation, where hu- man judgments often diverge even under det...

  3. [3]

    The overall statistics for the annotated data are listed in Table 1

    Hlava Cor and Hlava AD: General Settings To study human label variation, we extracted texts primarily from different parts of the PDT-C 2.0 (see detailed description in Sections 4.3 and 5) and cre- ated Hlava Cor and Hlava AD. The overall statistics for the annotated data are listed in Table 1. Hlava Cor contains 1,024 cases annotated by three an- notator...

  4. [4]

    no antecedent

    Hlava Cor: Human Language Variation in Coreference The annotation task for Hlava Cor was defined as the identification of coreference, i.e. reference to the same extra-linguistic entity, concept, or situa- tion. 4.1. Hlava Cor: Annotation Process Annotatorswereinstructedtoreadthecolumn Sen- tence with a highlighted expression in question and the left-hand...

  5. [5]

    4 disagreement possible

    using the coreference resolution system Cor- Pipe(Straka,2024),thewinneroftheCRAC2024shared task (Novák et al., 2024), and selected 5 achieving the best results on the two Czech CorefUD datasets. 4 disagreement possible. Based on the output of the five models, the potential examples were divided into two groups: (i) examples where all five models agreed o...

  6. [6]

    target of the relation

    Hlava AD: Human Label Variation in Attribution and Discourse The corpus contains 512 short Czech text seg- ments with explicit inter-sentential discourse rela- tions, multiply annotated by five annotators. All texts in Hlava AD come from PDT-C 2.0. Similarly to Hlava Cor, we separately considered the written and spoken modalities. Basedonourpreviousfindin...

  7. [7]

    However, itispossibletocomparethe averagepairwiseagreementontargetidentification, which neutralizes the difference in the number of annotators

    Observations across Hlava Cor and Hlava AD Given that Hlava Cor covers a dataset twice the size of Hlava AD, and that Hlava AD was annotated by five annotators while Hlava Cor involved only three parallel annotations, the results are not fully comparable. However, itispossibletocomparethe averagepairwiseagreementontargetidentification, which neutralizes t...

  8. [8]

    Conclusions In this paper, we presented and described two datasets created to study human label variation: Hlava Cor (Nedoluzhko et al., 2026), focused on coreference, and Hlava AD (Šárka Zikánová et al., 2024), focused on attribution and discourse rela- tions. Hlava Cor explores coreference with respect to the reference status of anaphoric expressions (s...

  9. [9]

    Basile, B

    Bibliographical References V. Basile, B. Plank, D. Hovy, P. Van Der Lee, L. Van Der Plas, and M. Poesio. 2021. We need to con- siderdisagreementinevaluation. InProceedings of 1st Workshop on Benchmarking, ACL, pages 15–21. Peter Bourgonje and Manfred Stede. 2020. The Potsdam commentary corpus 2.2: Extending an- notations for shallow discourse parsing. InP...

  10. [10]

    InHuman Lan- guageTechnology.ChallengesforComputerSci- ence and Linguistics - 6th Language and Tech- nology Conference, LTC 2013, Poznan, Poland, December 7-9, 2013

    Polish coreference corpus. InHuman Lan- guageTechnology.ChallengesforComputerSci- ence and Linguistics - 6th Language and Tech- nology Conference, LTC 2013, Poznan, Poland, December 7-9, 2013. Revised Selected Papers, volume 9561 ofLecture Notes in Computer Sci- ence, pages 215–226. Springer. SiyaoPeng,YangJanetLiu,andAmirZeldes.2022. GCDT: A Chinese RST ...

  11. [11]

    A crowdsourced corpus of multiple judg- ments and disagreement on anaphoric interpre- tation. InProceedings of the 2019 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long and Short Papers), pages 1778–1789, Minneapolis, Min- nesota. Association for Computational Lingui...

  12. [12]

    InHandbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploita- tion, pages 54–63

    Ontonotes: A large training corpus for enhanced processing. InHandbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploita- tion, pages 54–63. Springer-Verlag, New York. Frances Yung, Merel Scholman, Sarka Zikanova, and Vera Demberg. 2024. DiscoGeM 2.0: A parallel corpus of English, German, French and Czech im...

  13. [13]

    Language Resource References Jan Hajič, Eduard Bejček, Alevtina Bémová, Eva Buráňová, Eva Fučíková, Eva Hajičová, Jiří Havelka, Jaroslava Hlaváčová, Petr Ho- mola, Pavel Ircing, Jiří Kárník, Václava Ket- tnerová, Natalia Klyueva, Veronika Kolářová, Lu- cie Kučová, Markéta Lopatková, David Mareček, Marie Mikulová, Jiří Mírovský, Anna Nedoluzhko, Michal Nov...