pith. machine review for the scientific record. sign in

arxiv: 2605.09955 · v1 · submitted 2026-05-11 · 💻 cs.CL

Recognition: no theorem link

Beyond Majority Voting: Agreement-Based Clustering to Model Annotator Perspectives in Subjective NLP Tasks

Abinew Ali Ayele, Alexander Gelbukh, Eusebio Ric\'ardez-V\'azquez, Ibrahim Said Ahmad, Idris Abdulmumin, Olga Kolesnikova, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad, Tadesse Destaw Belay

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:19 UTC · model grok-4.3

classification 💻 cs.CL
keywords annotator disagreementclusteringsubjective NLPlabel aggregationsentiment analysisemotion classificationhate speech detection
0
0 comments X

The pith

Agreement-based clustering of annotators captures diverse perspectives and boosts performance on subjective NLP tasks beyond majority voting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that grouping annotators by how they agree on labels, rather than defaulting to a majority vote or building a separate model for each person, lets systems use the full range of viewpoints present in subjective data. This clustering approach was tested across 40 datasets spanning 18 languages on sentiment analysis, emotion classification, and hate speech detection. Experiments compared four ways of turning the clusters into predictions and found consistent gains over both simple majority aggregation and full per-annotator modeling. Multi-label and multitask setups worked particularly well once annotators were grouped. The work matters because disagreement in these tasks often reflects real differences in judgment that standard pipelines throw away.

Core claim

We propose an agreement-based clustering technique to model the disagreement between the annotators. We conduct comprehensive experiments in 40 datasets in 18 typologically diverse languages, covering three subjective NLP tasks: sentiment analysis, emotion classification, and hate speech detection. We evaluate four aggregation approaches: majority vote, ensemble, multi-label, and multitask. The results demonstrate that agreement-based clustering can leverage the full spectrum of annotator perspectives and significantly enhance classification performance in subjective NLP tasks compared to majority voting and individual annotator modeling. Regarding the aggregation approach, the multi-label a

What carries the argument

Agreement-based clustering that groups annotators according to shared label patterns so each cluster represents a coherent perspective for downstream training.

If this is right

  • Clustered annotator models outperform both majority voting and individual annotator models on the three tasks examined.
  • Multi-label and multitask learning yield higher gains than ensemble methods when the input is already grouped by agreement clusters.
  • The approach scales to 18 languages without requiring separate models for every annotator.
  • Disagreement information is preserved through the clusters instead of being collapsed into a single label.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dataset creators could design future annotation campaigns around fewer but more representative cluster-level labels rather than exhaustive individual labeling.
  • The same clustering step might be applied to other subjective tasks such as sarcasm detection or stance classification to test whether the performance pattern holds.
  • If the discovered clusters align with demographic or cultural factors, the method could support audits of model fairness across annotator groups.

Load-bearing premise

That patterns of agreement between annotators form stable clusters that reflect genuine perspective differences rather than noise or dataset artifacts.

What would settle it

Re-running the full pipeline on a new collection of subjective annotations where label disagreements are known to be random and checking whether the clustered models still outperform majority voting and per-annotator baselines.

Figures

Figures reproduced from arXiv: 2605.09955 by Abinew Ali Ayele, Alexander Gelbukh, Eusebio Ric\'ardez-V\'azquez, Ibrahim Said Ahmad, Idris Abdulmumin, Olga Kolesnikova, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad, Tadesse Destaw Belay.

Figure 1
Figure 1. Figure 1: Pairwise Agreement between annotators. This agreement score is used to group annotators [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of multi-annotator modeling architectures. Our new contribution is in the clustering [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Disagreement in annotation is a common phenomenon in the development of NLP datasets and serves as a valuable source of insight. While majority voting remains the dominant strategy for aggregating labels, recent work has explored modeling individual annotators to preserve their perspectives. However, modeling each annotator is resource-intensive and remains underexplored across various NLP tasks. We propose an agreement-based clustering technique to model the disagreement between the annotators. We conduct comprehensive experiments in 40 datasets in 18 typologically diverse languages, covering three subjective NLP tasks: sentiment analysis, emotion classification, and hate speech detection. We evaluate four aggregation approaches: majority vote, ensemble, multi-label, and multitask. The results demonstrate that agreement-based clustering can leverage the full spectrum of annotator perspectives and significantly enhance classification performance in subjective NLP tasks compared to majority voting and individual annotator modeling. Regarding the aggregation approach, the multi-label and multitask approaches are better for modeling clustered annotators than an ensemble and model majority vote.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper proposes an agreement-based clustering technique to group annotators by their label agreement patterns, thereby modeling diverse perspectives in subjective NLP tasks without modeling each annotator individually. It evaluates this approach on 40 datasets across 18 typologically diverse languages for sentiment analysis, emotion classification, and hate speech detection. Four aggregation strategies are compared—majority vote, ensemble, multi-label, and multitask—with results claiming that clustering yields significant performance gains over majority voting and per-annotator baselines, and that multi-label/multitask heads are preferable for clustered annotators.

Significance. If the central empirical claims hold after validation, the work would provide a scalable alternative to both majority voting (which discards perspective information) and full per-annotator modeling (which is resource-heavy). The large-scale, multilingual evaluation across three tasks strengthens potential applicability. Credit is due for the breadth of the experimental comparison and for testing multiple aggregation heads; however, the significance is tempered by the absence of direct evidence that the clusters capture stable, semantically meaningful perspectives rather than dataset artifacts.

major comments (4)
  1. [§4] §4 (Experimental Setup): No cluster-stability analysis is reported (e.g., adjusted Rand index or normalized mutual information across bootstrap resamples, different random seeds, or data splits). This is load-bearing for the claim that agreement-based clustering 'leverages the full spectrum of annotator perspectives,' because performance improvements could arise from fitting to noise or task-specific artifacts rather than genuine perspective groups.
  2. [§5] §5 (Results): The abstract and results claim 'significantly enhance classification performance' relative to majority voting and individual modeling, yet no statistical significance tests (paired t-tests, McNemar’s test, or bootstrap confidence intervals) are provided for the reported gains. Without these, it is impossible to determine whether the improvements are robust or attributable to variance.
  3. [§3 and §5.2] §3 (Method) and §5.2 (Ablations): There is no ablation that isolates the contribution of the clustering step from the choice of multi-label or multitask aggregation heads. The performance advantage could therefore be driven primarily by the multi-head architectures rather than by the agreement-based grouping itself.
  4. [§3] §3 (Clustering details): The description of agreement-based clustering does not specify the distance metric used to compute annotator similarity or the procedure for selecting the number of clusters (e.g., elbow method, silhouette optimization, or fixed k). This omission affects reproducibility and raises questions about whether the method is as parameter-light as implied.
minor comments (2)
  1. [Abstract] The abstract lists the four aggregation approaches but does not name them explicitly; spelling them out would improve immediate readability.
  2. [Figures in §5] Figure captions and axis labels in the results section could more clearly indicate which baseline each bar corresponds to, especially when multiple languages/tasks are plotted together.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects for strengthening the empirical claims and reproducibility. We address each major comment point by point below and commit to revisions that directly incorporate the suggestions.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Setup): No cluster-stability analysis is reported (e.g., adjusted Rand index or normalized mutual information across bootstrap resamples, different random seeds, or data splits). This is load-bearing for the claim that agreement-based clustering 'leverages the full spectrum of annotator perspectives,' because performance improvements could arise from fitting to noise or task-specific artifacts rather than genuine perspective groups.

    Authors: We agree that cluster-stability analysis is necessary to support the claim that the clusters capture meaningful annotator perspectives. In the revised manuscript, we will add a new analysis in §4 reporting adjusted Rand index (ARI) and normalized mutual information (NMI) computed across bootstrap resamples, multiple random seeds, and different data splits. This will demonstrate that the agreement-based clusters are stable and not artifacts of noise or specific dataset splits. revision: yes

  2. Referee: [§5] §5 (Results): The abstract and results claim 'significantly enhance classification performance' relative to majority voting and individual modeling, yet no statistical significance tests (paired t-tests, McNemar’s test, or bootstrap confidence intervals) are provided for the reported gains. Without these, it is impossible to determine whether the improvements are robust or attributable to variance.

    Authors: We appreciate this point and agree that statistical validation is required. We will update §5 (and the abstract where appropriate) to include paired t-tests across the 40 datasets and bootstrap confidence intervals for the performance differences. These tests will confirm whether the gains from agreement-based clustering over majority voting and per-annotator baselines are statistically significant. revision: yes

  3. Referee: [§3 and §5.2] §3 (Method) and §5.2 (Ablations): There is no ablation that isolates the contribution of the clustering step from the choice of multi-label or multitask aggregation heads. The performance advantage could therefore be driven primarily by the multi-head architectures rather than by the agreement-based grouping itself.

    Authors: We acknowledge the absence of a direct ablation isolating clustering from the aggregation heads. In the revised §5.2, we will add experiments comparing multi-label and multitask heads on clustered annotators versus the same heads applied without clustering (e.g., on the full set of annotators treated as one group). This will separate the effect of agreement-based grouping from the choice of multi-head architecture. revision: yes

  4. Referee: [§3] §3 (Clustering details): The description of agreement-based clustering does not specify the distance metric used to compute annotator similarity or the procedure for selecting the number of clusters (e.g., elbow method, silhouette optimization, or fixed k). This omission affects reproducibility and raises questions about whether the method is as parameter-light as implied.

    Authors: We will revise §3 to explicitly specify the distance metric (cosine similarity on annotator agreement vectors) and the cluster selection procedure (silhouette score optimization with a cap of k=10 to preserve efficiency). These details will be added to the method description to ensure full reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical method comparison

full rationale

The paper proposes an agreement-based clustering method and evaluates it empirically across 40 datasets in 18 languages for three subjective tasks, comparing four aggregation strategies (majority vote, ensemble, multi-label, multitask) against baselines. No equations, derivations, or first-principles claims appear that reduce performance gains to quantities defined by the clustering step itself. Results are reported as observed experimental outcomes rather than forced by construction or self-citation chains. The central claim rests on dataset-specific performance numbers, which are externally falsifiable and not tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied empirical paper; it introduces no new mathematical axioms, free parameters beyond standard clustering hyperparameters, or invented entities.

pith-pipeline@v0.9.0 · 5522 in / 1052 out tokens · 57582 ms · 2026-05-12T04:19:32.094355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages

  1. [1]

    CoRR , volume =

    Joshua Goodman , title =. CoRR , volume =. 2001 , url =

  2. [2]

    Information Systems for Intelligent Systems

    Parmar, Unnati and Modh, Jatin. Information Systems for Intelligent Systems. 2026. doi:https://doi.org/10.1007/978-3-032-12993-2_29

  3. [3]

    https://link.springer.com/article/10.1186/s43067-026-00322-4

    Sentiment analysis based on deep learning approaches for text classification , author=. Journal of Electrical Systems and Information Technology , volume=. 2026 , DOI="https://link.springer.com/article/10.1186/s43067-026-00322-4", publisher=

  4. [4]

    Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

    Beck, Tilman and Schuff, Hendrik and Lauscher, Anne and Gurevych, Iryna. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.eacl-long.159

  5. [5]

    Proceedings of the 21st International Conference on Natural Language Processing (ICON)

    Kapil, Prashant and Ekbal, Asif. Proceedings of the 21st International Conference on Natural Language Processing (ICON). 2024

  6. [6]

    Joshua T. Goodman. A bit of progress in language modeling. Computer Speech & Language. 2001. doi:10.1006/csla.2001.0174

  7. [7]

    Proceedings of the 2015 SIAM International Conference on Data Mining (SDM) , publisher =

    Himabindu Lakkaraju and Jure Leskovec and Jon Kleinberg and Sendhil Mullainathan , title =. Proceedings of the 2015 SIAM International Conference on Data Mining (SDM) , publisher =. 2015 , pages =. doi:10.1137/1.9781611974010.21 , URL =

  8. [8]

    2014 , isbn =

    Venanzi, Matteo and Guiver, John and Kazai, Gabriella and Kohli, Pushmeet and Shokouhi, Milad , title =. 2014 , isbn =. doi:10.1145/2566486.2567989 , booktitle =

  9. [9]

    and Lam, Michelle S

    Gordon, Mitchell L. and Lam, Michelle S. and Park, Joon Sung and Patel, Kayur and Hancock, Jeff and Hashimoto, Tatsunori and Bernstein, Michael S. , title =. 2022 , isbn =. doi:10.1145/3491102.3502004 , booktitle =

  10. [10]

    Improving Label Quality by Jointly Modeling Items and Annotators

    Weerasooriya, Tharindu Cyril and Ororbia, Alexander and Homan, Christopher. Improving Label Quality by Jointly Modeling Items and Annotators. Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022. 2022

  11. [11]

    Language Resources and Evaluation , pages=

    Introducing the Gab Hate Corpus: defining and applying hate-based rhetoric to social media posts at scale , author=. Language Resources and Evaluation , pages=. 2022 , url=

  12. [12]

    arXiv preprint arXiv:2201.08277 , year=

    Naijasenti: A nigerian twitter sentiment corpus for multilingual sentiment analysis , author=. arXiv preprint arXiv:2201.08277 , year=

  13. [13]

    CoRR , volume =

    Rebecca Hwa , title =. CoRR , volume =. 1999 , url =

  14. [14]

    Supervised Grammar Induction using Training Data with Limited Constituent Information

    Hwa, Rebecca. Supervised Grammar Induction using Training Data with Limited Constituent Information. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. 1999

  15. [15]

    , title =

    Jurafsky, Daniel and Martin, James H. , title =

  16. [17]

    Sensors , VOLUME =

    Hayat, Hassan and Ventura, Carles and Lapedriza, Agata , TITLE =. Sensors , VOLUME =. 2022 , NUMBER =

  17. [18]

    EPIC : Multi-Perspective Annotation of a Corpus of Irony

    Frenda, Simona and Pedrani, Alessandro and Basile, Valerio and Lo, Soda Marem and Cignarella, Alessandra Teresa and Panizzon, Raffaella and Marco, Cristina and Scarlini, Bianca and Patti, Viviana and Bosco, Cristina and Bernardi, Davide. EPIC : Multi-Perspective Annotation of a Corpus of Irony. Proceedings of the 61st Annual Meeting of the Association for...

  18. [19]

    Proceedings of the AAAI conference on human computation and crowdsourcing , volume=

    Modeling annotator perspective and polarized opinions to improve hate speech detection , author=. Proceedings of the AAAI conference on human computation and crowdsourcing , volume=. doi:https://doi.org/10.1609/hcomp.v8i1.7473 , year=

  19. [20]

    , author=

    Hierarchical Clustering of Label-based Annotator Representations for Mining Perspectives. , author=. NLPerspectives@ ECAI , url=

  20. [21]

    Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work and Social Computing , year =

    Kairam, Sanjay and Heer, Jeffrey , title =. Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work and Social Computing , year =. doi:10.1145/2818048.2820016 , pages =

  21. [22]

    C onv A buse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI

    Cercas Curry, Amanda and Abercrombie, Gavin and Rieser, Verena. C onv A buse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.587

  22. [23]

    Multi-lingual W ikipedia Summarization and Title Generation On Low Resource Corpus

    Liu, Wei and Li, Lei and Huang, Zuying and Liu, Yinan. Multi-lingual W ikipedia Summarization and Title Generation On Low Resource Corpus. Proceedings of the Workshop MultiLing 2019: Summarization Across Languages, Genres and Sources. 2019. doi:10.26615/978-954-452-058-8_004

  23. [24]

    Exploring A mharic Sentiment Analysis from Social Media Texts: Building Annotation Tools and Classification Models

    Yimam, Seid Muhie and Alemayehu, Hizkiel Mitiku and Ayele, Abinew and Biemann, Chris. Exploring A mharic Sentiment Analysis from Social Media Texts: Building Annotation Tools and Classification Models. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.91

  24. [25]

    Predicting Humorousness and Metaphor Novelty with G aussian Process Preference Learning

    Simpson, Edwin and Do Dinh, Erik-L \^a n and Miller, Tristan and Gurevych, Iryna. Predicting Humorousness and Metaphor Novelty with G aussian Process Preference Learning. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1572

  25. [26]

    Examining and mitigating gender bias in text emotion detection task , journal =

    Odbal and Guanhong Zhang and Sophia Ananiadou , volume =. Examining and mitigating gender bias in text emotion detection task , journal =. 2022 , issn =. doi:https://doi.org/10.1016/j.neucom.2022.04.057 , url =

  26. [27]

    G o E motions: A Dataset of Fine-Grained Emotions

    Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith. G o E motions: A Dataset of Fine-Grained Emotions. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.372

  27. [28]

    WMD at S em E val-2020 Tasks 7 and 11: Assessing Humor and Propaganda Using Unsupervised Data Augmentation

    Daval-Frerot, Guillaume and Weis, Yannick. WMD at S em E val-2020 Tasks 7 and 11: Assessing Humor and Propaganda Using Unsupervised Data Augmentation. Proceedings of the Fourteenth Workshop on Semantic Evaluation. 2020. doi:10.18653/v1/2020.semeval-1.246

  28. [29]

    2018 , publisher=

    Overview of TASS 2018: Opinions, health and emotions , author=. 2018 , publisher=

  29. [30]

    Disagreement in Argumentation Annotation

    Lindahl, Anna. Disagreement in Argumentation Annotation. Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024. 2024

  30. [31]

    You are what you annotate: Towards better models through annotator representations,

    Deng, Naihao and Zhang, Xinliang and Liu, Siyang and Wu, Winston and Wang, Lu and Mihalcea, Rada. You Are What You Annotate: Towards Better Models through Annotator Representations. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.832

  31. [32]

    Why don’t you do it right? analysing annotators’ disagreement in subjective tasks,

    Sandri, Marta and Leonardelli, Elisa and Tonelli, Sara and Jezek, Elisabetta. Why Don`t You Do It Right? Analysing Annotators' Disagreement in Subjective Tasks. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.178

  32. [33]

    2023 , isbn =

    Wan, Ruyuan and Kim, Jaehyung and Kang, Dongyeop , title =. 2023 , isbn =. doi:10.1609/aaai.v37i12.26698 , booktitle =

  33. [34]

    In: Proc

    Mokhberian, Negar and Marmarelis, Myrl and Hopp, Frederic and Basile, Valerio and Morstatter, Fred and Lerman, Kristina. Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Lo...

  34. [35]

    arXiv preprint arXiv:2112.04554 , url=

    Whose ground truth? accounting for individual and collective identities underlying dataset annotation , author=. arXiv preprint arXiv:2112.04554 , url=

  35. [36]

    Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines

    Sabou, Marta and Bontcheva, Kalina and Derczynski, Leon and Scharl, Arno. Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines. Proceedings of the Ninth International Conference on Language Resources and Evaluation ( LREC `14). 2014

  36. [37]

    Most babies are little and most problems are huge : Compositional Entailment in Adjective-Nouns

    Pavlick, Ellie and Callison-Burch, Chris. Most babies are little and most problems are huge : Compositional Entailment in Adjective-Nouns. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1204

  37. [38]

    Do You Know That Florence Is Packed with Visitors? Evaluating State-of-the-art Models of Speaker Commitment

    Jiang, Nanjiang and de Marneffe, Marie-Catherine. Do You Know That Florence Is Packed with Visitors? Evaluating State-of-the-art Models of Speaker Commitment. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1412

  38. [39]

    Leveraging Annotator Disagreement for Text Classification

    Xu, Jin and Theune, Mari. Leveraging Annotator Disagreement for Text Classification. Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024). 2024

  39. [40]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Pinpointing fine-grained relationships between hateful tweets and replies , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. doi:https://doi.org/10.1609/aaai.v36i10.21284 , year=

  40. [41]

    Exploring Cross-Cultural Differences in E nglish Hate Speech Annotations: From Dataset Construction to Analysis

    Lee, Nayeon and Jung, Chani and Myung, Junho and Jin, Jiho and Camacho-Collados, Jose and Kim, Juho and Oh, Alice. Exploring Cross-Cultural Differences in E nglish Hate Speech Annotations: From Dataset Construction to Analysis. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language...

  41. [42]

    and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo , title =

    Uma, Alexandra N. and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo , title =. 2022 , issue_date =. doi:10.1613/jair.1.12752 , journal =

  42. [43]

    2019 , url =

    Cjadams and Daniel Borkan and inversion and Jeffrey Sorensen and Lucas Dixon and Lucy Vasserman and nithum , title =. 2019 , url =

  43. [44]

    2018 , isbn =

    Diaz, Mark and Johnson, Isaac and Lazar, Amanda and Piper, Anne Marie and Gergle, Darren , title =. 2018 , isbn =. doi:10.1145/3173574.3173986 , booktitle =

  44. [45]

    2022 , eprint=

    HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection , author=. 2022 , eprint=

  45. [46]

    Handbook of cognition and emotion , volume=

    Basic emotions , author=. Handbook of cognition and emotion , volume=. doi:10.1002/0470013494 , year=

  46. [47]

    Muhammad, Shamsuddeen Hassan and Abdulmumin, Idris and Ayele, Abinew Ali and Ousidhoum, Nedjma and Adelani, David Ifeoluwa and Yimam, Seid Muhie and Ahmad, Ibrahim Sa'id and Beloucif, Meriem and Mohammad, Saif M. and Ruder, Sebastian and Hourrane, Oumaima and Brazdil, Pavel and Jorge, Alipio and Ali, Felermino D \'a rio M \'a rio Ant \'o nio and David, Da...

  47. [48]

    American scientist , volume=

    The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice , author=. American scientist , volume=. 2001 , publisher=

  48. [49]

    Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks

    R. Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.13

  49. [50]

    S em E val-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection

    Muhammad, Shamsuddeen Hassan and Ousidhoum, Nedjma and Abdulmumin, Idris and Yimam, Seid Muhie and Wahle, Jan Philip and Ruas, Terry and Beloucif, Meriem and De Kock, Christine and Belay, Tadesse Destaw and Ahmad, Ibrahim Said and Surange, Nirmal and Teodorescu, Daniela and Adelani, David Ifeoluwa and Aji, Alham Fikri and Ali, Felermino and Araujo, Vladim...

  50. [51]

    A fri H ate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for A frican Languages

    Muhammad, Shamsuddeen Hassan and Abdulmumin, Idris and Ayele, Abinew Ali and Adelani, David Ifeoluwa and Ahmad, Ibrahim Said and Aliyu, Saminu Mohammad and R. A fri H ate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for A frican Languages. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associati...

  51. [52]

    The `` Problem '' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation

    Plank, Barbara. The `` Problem '' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.731

  52. [53]

    Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation. 2025

  53. [54]

    Disagreement in Metaphor Annotation of M exican S panish Science Tweets

    S \'a nchez-Montero, Alec and Bel-Enguix, Gemma and Ojeda-Trueba, Sergio-Luis and Sierra, Gerardo. Disagreement in Metaphor Annotation of M exican S panish Science Tweets. Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation. 2025

  54. [55]

    2024 , eprint=

    Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks? , author=. 2024 , eprint=

  55. [56]

    Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , pages=

    The disagreement deconvolution: Bringing machine learning performance metrics in line with reality , author=. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , pages=. doi:https://doi.org/10.1145/3411764.3445423 , year=

  56. [57]

    ACTOR : Active Learning with Annotator-specific Classification Heads to Embrace Human Label Variation

    Wang, Xinpeng and Plank, Barbara. ACTOR : Active Learning with Annotator-specific Classification Heads to Embrace Human Label Variation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.126

  57. [58]

    Challenges for Toxic Comment Classification: An In-Depth Error Analysis

    van Aken, Betty and Risch, Julian and Krestel, Ralf and L. Challenges for Toxic Comment Classification: An In-Depth Error Analysis. Proceedings of the 2nd Workshop on Abusive Language Online ( ALW 2). 2018. doi:10.18653/v1/W18-5105

  58. [59]

    Is a bunch of words enough to detect disagreement in hateful content?

    Rizzi, Giulia and Rosso, Paolo and Fersini, Elisabetta. Is a bunch of words enough to detect disagreement in hateful content?. Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation. 2025

  59. [60]

    Semeval-2023 task 11: Learning with disagreements (lewidi),

    Leonardelli, Elisa and Abercrombie, Gavin and Almanea, Dina and Basile, Valerio and Fornaciari, Tommaso and Plank, Barbara and Rieser, Verena and Uma, Alexandra and Poesio, Massimo. S em E val-2023 Task 11: Learning with Disagreements ( L e W i D i). Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). 2023. doi:10.18653/v...

  60. [61]

    Comparing B ayesian Models of Annotation

    Paun, Silviu and Carpenter, Bob and Chamberlain, Jon and Hovy, Dirk and Kruschwitz, Udo and Poesio, Massimo. Comparing B ayesian Models of Annotation. Transactions of the Association for Computational Linguistics. 2018. doi:10.1162/tacl_a_00040

  61. [62]

    Detecting Stance in Media On Global Warming

    Luo, Yiwei and Card, Dallas and Jurafsky, Dan. Detecting Stance in Media On Global Warming. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.296

  62. [63]

    Proceedings of the International AAAI Conference on Web and Social Media , volume=

    Annobert: Effectively representing multiple annotators’ label choices to improve hate speech detection , author=. Proceedings of the International AAAI Conference on Web and Social Media , volume=

  63. [64]

    , author=

    Mining Annotator Perspectives from Hate Speech Corpora. , author=. NL4AI@ AI* IA , url=

  64. [65]

    Evaluating the Capabilities of Large Language Models for Multi-label Emotion Understanding

    Belay, Tadesse Destaw and Azime, Israel Abebe and Ayele, Abinew Ali and Sidorov, Grigori and Klakow, Dietrich and Slusallek, Philip and Kolesnikova, Olga and Yimam, Seid Muhie. Evaluating the Capabilities of Large Language Models for Multi-label Emotion Understanding. Proceedings of the 31st International Conference on Computational Linguistics. 2025

  65. [66]

    and Adelani, David Ifeoluwa and Mosbach, Marius and Klakow, Dietrich

    Alabi, Jesujoba O. and Adelani, David Ifeoluwa and Mosbach, Marius and Klakow, Dietrich. Adapting Pre-trained Language Models to A frican Languages via Multilingual Adaptive Fine-Tuning. Proceedings of the 29th International Conference on Computational Linguistics. 2022

  66. [67]

    Overview of Second Shared Task on Sentiment Analysis in Code-mixed T amil and T ulu

    Sambath Kumar, Lavanya and Hegde, Asha and Chakravarthi, Bharathi Raja and Shashirekha, Hosahalli and Natarajan, Rajeswari and Thavareesan, Sajeetha and Sakuntharaj, Ratnasingam and Durairaj, Thenmozhi and Kumaresan, Prasanna Kumar and Rajkumar, Charmathi. Overview of Second Shared Task on Sentiment Analysis in Code-mixed T amil and T ulu. Proceedings of ...

  67. [68]

    Overview of the Shared Task on Multimodal Hate Speech Detection in D ravidian languages: D ravidian L ang T ech@ NAACL 2025

    G, Jyothish Lal and B, Premjith and Chakravarthi, Bharathi Raja and Rajiakodi, Saranya and B, Bharathi and Natarajan, Rajeswari and Rajalakshmi, Ratnavel. Overview of the Shared Task on Multimodal Hate Speech Detection in D ravidian languages: D ravidian L ang T ech@ NAACL 2025. Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologie...

  68. [69]

    Artificial intelligence and law , volume=

    I beg to differ: how disagreement is handled in the annotation of legal machine learning data sets , author=. Artificial intelligence and law , volume=. 2024 , url =

  69. [70]

    XLM - T : Multilingual Language Models in T witter for Sentiment Analysis and Beyond

    Barbieri, Francesco and Espinosa Anke, Luis and Camacho-Collados, Jose. XLM - T : Multilingual Language Models in T witter for Sentiment Analysis and Beyond. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022

  70. [71]

    2020 , eprint=

    Unsupervised Cross-lingual Representation Learning at Scale , author=. 2020 , eprint=

  71. [72]

    Don`t Blame the Annotator: Bias Already Starts in the Annotation Instructions

    Parmar, Mihir and Mishra, Swaroop and Geva, Mor and Baral, Chitta. Don`t Blame the Annotator: Bias Already Starts in the Annotation Instructions. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.130

  72. [73]

    N aija S enti: A N igerian T witter Sentiment Corpus for Multilingual Sentiment Analysis

    Muhammad, Shamsuddeen Hassan and Adelani, David Ifeoluwa and Ruder, Sebastian and Ahmad, Ibrahim Sa ' id and Abdulmumin, Idris and Bello, Bello Shehu and Choudhury, Monojit and Emezue, Chris Chinenye and Abdullahi, Saheed Salahudeen and Aremu, Anuoluwapo and Jorge, Al \'i pio and Brazdil, Pavel. N aija S enti: A N igerian T witter Sentiment Corpus for Mul...

  73. [74]

    arXiv preprint arXiv:2502.11926 , year=

    BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages , author=. arXiv preprint arXiv:2502.11926 , year=

  74. [75]

    Investigating Reasons for Disagreement in Natural Language Inference

    Jiang, Nan-Jiang and de Marneffe, Marie-Catherine. Investigating Reasons for Disagreement in Natural Language Inference. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00523

  75. [76]

    Toward a perspectivist turn in ground truthing for predictive computing

    Cabitza, Federico and Campagner, Andrea and Basile, Valerio , year=. Toward a Perspectivist Turn in Ground Truthing for Predictive Computing , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v37i6.25840 , number=

  76. [77]

    2024 , eprint=

    Rethinking Emotion Annotations in the Era of Large Language Models , author=. 2024 , eprint=

  77. [78]

    Richard Landis and Gary G

    J. Richard Landis and Gary G. Koch , journal =. The Measurement of Observer Agreement for Categorical Data , urldate =

  78. [79]

    Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

    Emollms: A series of emotional large language models and annotation tools for comprehensive affective analysis , author=. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

  79. [80]

    Proceedings of the 37th Canadian Conference on Artificial Intelligence , year=

    Instruction Tuning of LLMs for Multi-label Emotion Classification in Social Media Content , author=. Proceedings of the 37th Canadian Conference on Artificial Intelligence , year=

  80. [81]

    Pseudo-Labeling With Large Language Models for Multi-Label Emotion Classification of French Tweets , year=

    Malik, Usman and Bernard, Simon and Pauchet, Alexandre and Chatelain, Clément and Picot-Clémente, Romain and Cortinovis, Jérôme , journal=. Pseudo-Labeling With Large Language Models for Multi-Label Emotion Classification of French Tweets , year=

Showing first 80 references.