arxiv: 2605.09955 · v1 · submitted 2026-05-11 · 💻 cs.CL

Recognition: no theorem link

Beyond Majority Voting: Agreement-Based Clustering to Model Annotator Perspectives in Subjective NLP Tasks

Abinew Ali Ayele, Alexander Gelbukh, Eusebio Ric\'ardez-V\'azquez, Ibrahim Said Ahmad, Idris Abdulmumin, Olga Kolesnikova, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad, Tadesse Destaw Belay

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:19 UTC · model grok-4.3

classification 💻 cs.CL

keywords annotator disagreementclusteringsubjective NLPlabel aggregationsentiment analysisemotion classificationhate speech detection

0 comments

The pith

Agreement-based clustering of annotators captures diverse perspectives and boosts performance on subjective NLP tasks beyond majority voting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that grouping annotators by how they agree on labels, rather than defaulting to a majority vote or building a separate model for each person, lets systems use the full range of viewpoints present in subjective data. This clustering approach was tested across 40 datasets spanning 18 languages on sentiment analysis, emotion classification, and hate speech detection. Experiments compared four ways of turning the clusters into predictions and found consistent gains over both simple majority aggregation and full per-annotator modeling. Multi-label and multitask setups worked particularly well once annotators were grouped. The work matters because disagreement in these tasks often reflects real differences in judgment that standard pipelines throw away.

Core claim

We propose an agreement-based clustering technique to model the disagreement between the annotators. We conduct comprehensive experiments in 40 datasets in 18 typologically diverse languages, covering three subjective NLP tasks: sentiment analysis, emotion classification, and hate speech detection. We evaluate four aggregation approaches: majority vote, ensemble, multi-label, and multitask. The results demonstrate that agreement-based clustering can leverage the full spectrum of annotator perspectives and significantly enhance classification performance in subjective NLP tasks compared to majority voting and individual annotator modeling. Regarding the aggregation approach, the multi-label a

What carries the argument

Agreement-based clustering that groups annotators according to shared label patterns so each cluster represents a coherent perspective for downstream training.

If this is right

Clustered annotator models outperform both majority voting and individual annotator models on the three tasks examined.
Multi-label and multitask learning yield higher gains than ensemble methods when the input is already grouped by agreement clusters.
The approach scales to 18 languages without requiring separate models for every annotator.
Disagreement information is preserved through the clusters instead of being collapsed into a single label.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dataset creators could design future annotation campaigns around fewer but more representative cluster-level labels rather than exhaustive individual labeling.
The same clustering step might be applied to other subjective tasks such as sarcasm detection or stance classification to test whether the performance pattern holds.
If the discovered clusters align with demographic or cultural factors, the method could support audits of model fairness across annotator groups.

Load-bearing premise

That patterns of agreement between annotators form stable clusters that reflect genuine perspective differences rather than noise or dataset artifacts.

What would settle it

Re-running the full pipeline on a new collection of subjective annotations where label disagreements are known to be random and checking whether the clustered models still outperform majority voting and per-annotator baselines.

Figures

Figures reproduced from arXiv: 2605.09955 by Abinew Ali Ayele, Alexander Gelbukh, Eusebio Ric\'ardez-V\'azquez, Ibrahim Said Ahmad, Idris Abdulmumin, Olga Kolesnikova, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad, Tadesse Destaw Belay.

**Figure 2.** Figure 2: Overview of multi-annotator modeling architectures. Our new contribution is in the clustering [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Disagreement in annotation is a common phenomenon in the development of NLP datasets and serves as a valuable source of insight. While majority voting remains the dominant strategy for aggregating labels, recent work has explored modeling individual annotators to preserve their perspectives. However, modeling each annotator is resource-intensive and remains underexplored across various NLP tasks. We propose an agreement-based clustering technique to model the disagreement between the annotators. We conduct comprehensive experiments in 40 datasets in 18 typologically diverse languages, covering three subjective NLP tasks: sentiment analysis, emotion classification, and hate speech detection. We evaluate four aggregation approaches: majority vote, ensemble, multi-label, and multitask. The results demonstrate that agreement-based clustering can leverage the full spectrum of annotator perspectives and significantly enhance classification performance in subjective NLP tasks compared to majority voting and individual annotator modeling. Regarding the aggregation approach, the multi-label and multitask approaches are better for modeling clustered annotators than an ensemble and model majority vote.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agreement-based clustering gives a workable middle ground between majority vote and per-annotator models on subjective tasks, but the paper leaves open whether the clusters track real perspective differences or just dataset noise.

read the letter

The paper's core move is to cluster annotators by how often they agree on labels, then train on those groups instead of forcing a single majority label or fitting one model per person. They run this on 40 datasets across 18 languages for sentiment, emotion, and hate speech, and report that the clustered setup beats both majority voting and full individual modeling when paired with multi-label or multitask heads. That scale and the direct comparison to the two common extremes is the useful part; it shows a lower-cost way to keep some of the disagreement signal without the full per-annotator overhead. The multi-label and multitask results look stronger than ensemble in their tables, which lines up with the idea that the clusters carry usable structure. The experiments cover typologically varied languages, which is a plus for anyone who has to handle non-English subjective data. The main soft spot is the missing validation that the clusters are stable or meaningful. Nothing in the write-up checks cluster stability across data splits, reports silhouette scores, or shows that annotators inside a cluster actually share interpretable criteria on held-out items. Without those, the performance lift could just be coming from the extra modeling capacity in the multi-head setups rather than from the agreement clustering itself. The abstract claims the clusters leverage the full spectrum of perspectives, but that rests on an assumption that isn't tested in the reported results. This is the kind of paper that belongs in a reading group for people who build datasets or train models on opinionated text. It gives a concrete recipe that practitioners can try without huge extra compute. It deserves peer review because the empirical footprint is large and the practical angle is clear, even though the clustering step needs tighter justification before the claims can be taken as settled.

Referee Report

4 major / 2 minor

Summary. The paper proposes an agreement-based clustering technique to group annotators by their label agreement patterns, thereby modeling diverse perspectives in subjective NLP tasks without modeling each annotator individually. It evaluates this approach on 40 datasets across 18 typologically diverse languages for sentiment analysis, emotion classification, and hate speech detection. Four aggregation strategies are compared—majority vote, ensemble, multi-label, and multitask—with results claiming that clustering yields significant performance gains over majority voting and per-annotator baselines, and that multi-label/multitask heads are preferable for clustered annotators.

Significance. If the central empirical claims hold after validation, the work would provide a scalable alternative to both majority voting (which discards perspective information) and full per-annotator modeling (which is resource-heavy). The large-scale, multilingual evaluation across three tasks strengthens potential applicability. Credit is due for the breadth of the experimental comparison and for testing multiple aggregation heads; however, the significance is tempered by the absence of direct evidence that the clusters capture stable, semantically meaningful perspectives rather than dataset artifacts.

major comments (4)

[§4] §4 (Experimental Setup): No cluster-stability analysis is reported (e.g., adjusted Rand index or normalized mutual information across bootstrap resamples, different random seeds, or data splits). This is load-bearing for the claim that agreement-based clustering 'leverages the full spectrum of annotator perspectives,' because performance improvements could arise from fitting to noise or task-specific artifacts rather than genuine perspective groups.
[§5] §5 (Results): The abstract and results claim 'significantly enhance classification performance' relative to majority voting and individual modeling, yet no statistical significance tests (paired t-tests, McNemar’s test, or bootstrap confidence intervals) are provided for the reported gains. Without these, it is impossible to determine whether the improvements are robust or attributable to variance.
[§3 and §5.2] §3 (Method) and §5.2 (Ablations): There is no ablation that isolates the contribution of the clustering step from the choice of multi-label or multitask aggregation heads. The performance advantage could therefore be driven primarily by the multi-head architectures rather than by the agreement-based grouping itself.
[§3] §3 (Clustering details): The description of agreement-based clustering does not specify the distance metric used to compute annotator similarity or the procedure for selecting the number of clusters (e.g., elbow method, silhouette optimization, or fixed k). This omission affects reproducibility and raises questions about whether the method is as parameter-light as implied.

minor comments (2)

[Abstract] The abstract lists the four aggregation approaches but does not name them explicitly; spelling them out would improve immediate readability.
[Figures in §5] Figure captions and axis labels in the results section could more clearly indicate which baseline each bar corresponds to, especially when multiple languages/tasks are plotted together.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects for strengthening the empirical claims and reproducibility. We address each major comment point by point below and commit to revisions that directly incorporate the suggestions.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup): No cluster-stability analysis is reported (e.g., adjusted Rand index or normalized mutual information across bootstrap resamples, different random seeds, or data splits). This is load-bearing for the claim that agreement-based clustering 'leverages the full spectrum of annotator perspectives,' because performance improvements could arise from fitting to noise or task-specific artifacts rather than genuine perspective groups.

Authors: We agree that cluster-stability analysis is necessary to support the claim that the clusters capture meaningful annotator perspectives. In the revised manuscript, we will add a new analysis in §4 reporting adjusted Rand index (ARI) and normalized mutual information (NMI) computed across bootstrap resamples, multiple random seeds, and different data splits. This will demonstrate that the agreement-based clusters are stable and not artifacts of noise or specific dataset splits. revision: yes
Referee: [§5] §5 (Results): The abstract and results claim 'significantly enhance classification performance' relative to majority voting and individual modeling, yet no statistical significance tests (paired t-tests, McNemar’s test, or bootstrap confidence intervals) are provided for the reported gains. Without these, it is impossible to determine whether the improvements are robust or attributable to variance.

Authors: We appreciate this point and agree that statistical validation is required. We will update §5 (and the abstract where appropriate) to include paired t-tests across the 40 datasets and bootstrap confidence intervals for the performance differences. These tests will confirm whether the gains from agreement-based clustering over majority voting and per-annotator baselines are statistically significant. revision: yes
Referee: [§3 and §5.2] §3 (Method) and §5.2 (Ablations): There is no ablation that isolates the contribution of the clustering step from the choice of multi-label or multitask aggregation heads. The performance advantage could therefore be driven primarily by the multi-head architectures rather than by the agreement-based grouping itself.

Authors: We acknowledge the absence of a direct ablation isolating clustering from the aggregation heads. In the revised §5.2, we will add experiments comparing multi-label and multitask heads on clustered annotators versus the same heads applied without clustering (e.g., on the full set of annotators treated as one group). This will separate the effect of agreement-based grouping from the choice of multi-head architecture. revision: yes
Referee: [§3] §3 (Clustering details): The description of agreement-based clustering does not specify the distance metric used to compute annotator similarity or the procedure for selecting the number of clusters (e.g., elbow method, silhouette optimization, or fixed k). This omission affects reproducibility and raises questions about whether the method is as parameter-light as implied.

Authors: We will revise §3 to explicitly specify the distance metric (cosine similarity on annotator agreement vectors) and the cluster selection procedure (silhouette score optimization with a cap of k=10 to preserve efficiency). These details will be added to the method description to ensure full reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical method comparison

full rationale

The paper proposes an agreement-based clustering method and evaluates it empirically across 40 datasets in 18 languages for three subjective tasks, comparing four aggregation strategies (majority vote, ensemble, multi-label, multitask) against baselines. No equations, derivations, or first-principles claims appear that reduce performance gains to quantities defined by the clustering step itself. Results are reported as observed experimental outcomes rather than forced by construction or self-citation chains. The central claim rests on dataset-specific performance numbers, which are externally falsifiable and not tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied empirical paper; it introduces no new mathematical axioms, free parameters beyond standard clustering hyperparameters, or invented entities.

pith-pipeline@v0.9.0 · 5522 in / 1052 out tokens · 57582 ms · 2026-05-12T04:19:32.094355+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages

[1]

CoRR , volume =

Joshua Goodman , title =. CoRR , volume =. 2001 , url =

work page 2001
[2]

Information Systems for Intelligent Systems

Parmar, Unnati and Modh, Jatin. Information Systems for Intelligent Systems. 2026. doi:https://doi.org/10.1007/978-3-032-12993-2_29

work page doi:10.1007/978-3-032-12993-2_29 2026
[3]

https://link.springer.com/article/10.1186/s43067-026-00322-4

Sentiment analysis based on deep learning approaches for text classification , author=. Journal of Electrical Systems and Information Technology , volume=. 2026 , DOI="https://link.springer.com/article/10.1186/s43067-026-00322-4", publisher=

work page doi:10.1186/s43067-026-00322-4 2026
[4]

Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Beck, Tilman and Schuff, Hendrik and Lauscher, Anne and Gurevych, Iryna. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.eacl-long.159

work page doi:10.18653/v1/2024.eacl-long.159 2024
[5]

Proceedings of the 21st International Conference on Natural Language Processing (ICON)

Kapil, Prashant and Ekbal, Asif. Proceedings of the 21st International Conference on Natural Language Processing (ICON). 2024

work page 2024
[6]

Joshua T. Goodman. A bit of progress in language modeling. Computer Speech & Language. 2001. doi:10.1006/csla.2001.0174

work page doi:10.1006/csla.2001.0174 2001
[7]

Proceedings of the 2015 SIAM International Conference on Data Mining (SDM) , publisher =

Himabindu Lakkaraju and Jure Leskovec and Jon Kleinberg and Sendhil Mullainathan , title =. Proceedings of the 2015 SIAM International Conference on Data Mining (SDM) , publisher =. 2015 , pages =. doi:10.1137/1.9781611974010.21 , URL =

work page doi:10.1137/1.9781611974010.21 2015
[8]

2014 , isbn =

Venanzi, Matteo and Guiver, John and Kazai, Gabriella and Kohli, Pushmeet and Shokouhi, Milad , title =. 2014 , isbn =. doi:10.1145/2566486.2567989 , booktitle =

work page doi:10.1145/2566486.2567989 2014
[9]

and Lam, Michelle S

Gordon, Mitchell L. and Lam, Michelle S. and Park, Joon Sung and Patel, Kayur and Hancock, Jeff and Hashimoto, Tatsunori and Bernstein, Michael S. , title =. 2022 , isbn =. doi:10.1145/3491102.3502004 , booktitle =

work page doi:10.1145/3491102.3502004 2022
[10]

Improving Label Quality by Jointly Modeling Items and Annotators

Weerasooriya, Tharindu Cyril and Ororbia, Alexander and Homan, Christopher. Improving Label Quality by Jointly Modeling Items and Annotators. Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022. 2022

work page 2022
[11]

Language Resources and Evaluation , pages=

Introducing the Gab Hate Corpus: defining and applying hate-based rhetoric to social media posts at scale , author=. Language Resources and Evaluation , pages=. 2022 , url=

work page 2022
[12]

arXiv preprint arXiv:2201.08277 , year=

Naijasenti: A nigerian twitter sentiment corpus for multilingual sentiment analysis , author=. arXiv preprint arXiv:2201.08277 , year=

work page arXiv
[13]

CoRR , volume =

Rebecca Hwa , title =. CoRR , volume =. 1999 , url =

work page 1999
[14]

Supervised Grammar Induction using Training Data with Limited Constituent Information

Hwa, Rebecca. Supervised Grammar Induction using Training Data with Limited Constituent Information. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. 1999

work page 1999
[15]

, title =

Jurafsky, Daniel and Martin, James H. , title =

work page
[17]

Sensors , VOLUME =

Hayat, Hassan and Ventura, Carles and Lapedriza, Agata , TITLE =. Sensors , VOLUME =. 2022 , NUMBER =

work page 2022
[18]

EPIC : Multi-Perspective Annotation of a Corpus of Irony

Frenda, Simona and Pedrani, Alessandro and Basile, Valerio and Lo, Soda Marem and Cignarella, Alessandra Teresa and Panizzon, Raffaella and Marco, Cristina and Scarlini, Bianca and Patti, Viviana and Bosco, Cristina and Bernardi, Davide. EPIC : Multi-Perspective Annotation of a Corpus of Irony. Proceedings of the 61st Annual Meeting of the Association for...

work page doi:10.18653/v1/2023.acl-long.774 2023
[19]

Proceedings of the AAAI conference on human computation and crowdsourcing , volume=

Modeling annotator perspective and polarized opinions to improve hate speech detection , author=. Proceedings of the AAAI conference on human computation and crowdsourcing , volume=. doi:https://doi.org/10.1609/hcomp.v8i1.7473 , year=

work page doi:10.1609/hcomp.v8i1.7473
[20]

, author=

Hierarchical Clustering of Label-based Annotator Representations for Mining Perspectives. , author=. NLPerspectives@ ECAI , url=

work page
[21]

Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work and Social Computing , year =

Kairam, Sanjay and Heer, Jeffrey , title =. Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work and Social Computing , year =. doi:10.1145/2818048.2820016 , pages =

work page doi:10.1145/2818048.2820016
[22]

C onv A buse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI

Cercas Curry, Amanda and Abercrombie, Gavin and Rieser, Verena. C onv A buse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.587

work page doi:10.18653/v1/2021.emnlp-main.587 2021
[23]

Multi-lingual W ikipedia Summarization and Title Generation On Low Resource Corpus

Liu, Wei and Li, Lei and Huang, Zuying and Liu, Yinan. Multi-lingual W ikipedia Summarization and Title Generation On Low Resource Corpus. Proceedings of the Workshop MultiLing 2019: Summarization Across Languages, Genres and Sources. 2019. doi:10.26615/978-954-452-058-8_004

work page doi:10.26615/978-954-452-058-8_004 2019
[24]

Exploring A mharic Sentiment Analysis from Social Media Texts: Building Annotation Tools and Classification Models

Yimam, Seid Muhie and Alemayehu, Hizkiel Mitiku and Ayele, Abinew and Biemann, Chris. Exploring A mharic Sentiment Analysis from Social Media Texts: Building Annotation Tools and Classification Models. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.91

work page doi:10.18653/v1/2020.coling-main.91 2020
[25]

Predicting Humorousness and Metaphor Novelty with G aussian Process Preference Learning

Simpson, Edwin and Do Dinh, Erik-L \^a n and Miller, Tristan and Gurevych, Iryna. Predicting Humorousness and Metaphor Novelty with G aussian Process Preference Learning. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1572

work page doi:10.18653/v1/p19-1572 2019
[26]

Examining and mitigating gender bias in text emotion detection task , journal =

Odbal and Guanhong Zhang and Sophia Ananiadou , volume =. Examining and mitigating gender bias in text emotion detection task , journal =. 2022 , issn =. doi:https://doi.org/10.1016/j.neucom.2022.04.057 , url =

work page doi:10.1016/j.neucom.2022.04.057 2022
[27]

G o E motions: A Dataset of Fine-Grained Emotions

Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith. G o E motions: A Dataset of Fine-Grained Emotions. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.372

work page doi:10.18653/v1/2020.acl-main.372 2020
[28]

WMD at S em E val-2020 Tasks 7 and 11: Assessing Humor and Propaganda Using Unsupervised Data Augmentation

Daval-Frerot, Guillaume and Weis, Yannick. WMD at S em E val-2020 Tasks 7 and 11: Assessing Humor and Propaganda Using Unsupervised Data Augmentation. Proceedings of the Fourteenth Workshop on Semantic Evaluation. 2020. doi:10.18653/v1/2020.semeval-1.246

work page doi:10.18653/v1/2020.semeval-1.246 2020
[29]

2018 , publisher=

Overview of TASS 2018: Opinions, health and emotions , author=. 2018 , publisher=

work page 2018
[30]

Disagreement in Argumentation Annotation

Lindahl, Anna. Disagreement in Argumentation Annotation. Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024. 2024

work page 2024
[31]

You are what you annotate: Towards better models through annotator representations,

Deng, Naihao and Zhang, Xinliang and Liu, Siyang and Wu, Winston and Wang, Lu and Mihalcea, Rada. You Are What You Annotate: Towards Better Models through Annotator Representations. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.832

work page doi:10.18653/v1/2023.findings-emnlp.832 2023
[32]

Why don’t you do it right? analysing annotators’ disagreement in subjective tasks,

Sandri, Marta and Leonardelli, Elisa and Tonelli, Sara and Jezek, Elisabetta. Why Don`t You Do It Right? Analysing Annotators' Disagreement in Subjective Tasks. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.178

work page doi:10.18653/v1/2023.eacl-main.178 2023
[33]

2023 , isbn =

Wan, Ruyuan and Kim, Jaehyung and Kang, Dongyeop , title =. 2023 , isbn =. doi:10.1609/aaai.v37i12.26698 , booktitle =

work page doi:10.1609/aaai.v37i12.26698 2023
[34]

In: Proc

Mokhberian, Negar and Marmarelis, Myrl and Hopp, Frederic and Basile, Valerio and Morstatter, Fred and Lerman, Kristina. Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Lo...

work page doi:10.18653/v1/2024.naacl-long.407 2024
[35]

arXiv preprint arXiv:2112.04554 , url=

Whose ground truth? accounting for individual and collective identities underlying dataset annotation , author=. arXiv preprint arXiv:2112.04554 , url=

work page arXiv
[36]

Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines

Sabou, Marta and Bontcheva, Kalina and Derczynski, Leon and Scharl, Arno. Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines. Proceedings of the Ninth International Conference on Language Resources and Evaluation ( LREC `14). 2014

work page 2014
[37]

Most babies are little and most problems are huge : Compositional Entailment in Adjective-Nouns

Pavlick, Ellie and Callison-Burch, Chris. Most babies are little and most problems are huge : Compositional Entailment in Adjective-Nouns. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1204

work page doi:10.18653/v1/p16-1204 2016
[38]

Do You Know That Florence Is Packed with Visitors? Evaluating State-of-the-art Models of Speaker Commitment

Jiang, Nanjiang and de Marneffe, Marie-Catherine. Do You Know That Florence Is Packed with Visitors? Evaluating State-of-the-art Models of Speaker Commitment. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1412

work page doi:10.18653/v1/p19-1412 2019
[39]

Leveraging Annotator Disagreement for Text Classification

Xu, Jin and Theune, Mari. Leveraging Annotator Disagreement for Text Classification. Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024). 2024

work page 2024
[40]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Pinpointing fine-grained relationships between hateful tweets and replies , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. doi:https://doi.org/10.1609/aaai.v36i10.21284 , year=

work page doi:10.1609/aaai.v36i10.21284
[41]

Exploring Cross-Cultural Differences in E nglish Hate Speech Annotations: From Dataset Construction to Analysis

Lee, Nayeon and Jung, Chani and Myung, Junho and Jin, Jiho and Camacho-Collados, Jose and Kim, Juho and Oh, Alice. Exploring Cross-Cultural Differences in E nglish Hate Speech Annotations: From Dataset Construction to Analysis. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language...

work page doi:10.18653/v1/2024.naacl-long.236 2024
[42]

and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo , title =

Uma, Alexandra N. and Fornaciari, Tommaso and Hovy, Dirk and Paun, Silviu and Plank, Barbara and Poesio, Massimo , title =. 2022 , issue_date =. doi:10.1613/jair.1.12752 , journal =

work page doi:10.1613/jair.1.12752 2022
[43]

2019 , url =

Cjadams and Daniel Borkan and inversion and Jeffrey Sorensen and Lucas Dixon and Lucy Vasserman and nithum , title =. 2019 , url =

work page 2019
[44]

2018 , isbn =

Diaz, Mark and Johnson, Isaac and Lazar, Amanda and Piper, Anne Marie and Gergle, Darren , title =. 2018 , isbn =. doi:10.1145/3173574.3173986 , booktitle =

work page doi:10.1145/3173574.3173986 2018
[45]

2022 , eprint=

HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection , author=. 2022 , eprint=

work page 2022
[46]

Handbook of cognition and emotion , volume=

Basic emotions , author=. Handbook of cognition and emotion , volume=. doi:10.1002/0470013494 , year=

work page doi:10.1002/0470013494
[47]

Muhammad, Shamsuddeen Hassan and Abdulmumin, Idris and Ayele, Abinew Ali and Ousidhoum, Nedjma and Adelani, David Ifeoluwa and Yimam, Seid Muhie and Ahmad, Ibrahim Sa'id and Beloucif, Meriem and Mohammad, Saif M. and Ruder, Sebastian and Hourrane, Oumaima and Brazdil, Pavel and Jorge, Alipio and Ali, Felermino D \'a rio M \'a rio Ant \'o nio and David, Da...

work page doi:10.18653/v1/2023.emnlp-main.862 2023
[48]

American scientist , volume=

The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice , author=. American scientist , volume=. 2001 , publisher=

work page 2001
[49]

Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks

R. Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.13

work page doi:10.18653/v1/2022.naacl-main.13 2022
[50]

S em E val-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection

Muhammad, Shamsuddeen Hassan and Ousidhoum, Nedjma and Abdulmumin, Idris and Yimam, Seid Muhie and Wahle, Jan Philip and Ruas, Terry and Beloucif, Meriem and De Kock, Christine and Belay, Tadesse Destaw and Ahmad, Ibrahim Said and Surange, Nirmal and Teodorescu, Daniela and Adelani, David Ifeoluwa and Aji, Alham Fikri and Ali, Felermino and Araujo, Vladim...

work page 2025
[51]

A fri H ate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for A frican Languages

Muhammad, Shamsuddeen Hassan and Abdulmumin, Idris and Ayele, Abinew Ali and Adelani, David Ifeoluwa and Ahmad, Ibrahim Said and Aliyu, Saminu Mohammad and R. A fri H ate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for A frican Languages. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associati...

work page doi:10.18653/v1/2025.naacl-long.92 2025
[52]

The `` Problem '' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation

Plank, Barbara. The `` Problem '' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.731

work page doi:10.18653/v1/2022.emnlp-main.731 2022
[53]

Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation. 2025

work page 2025
[54]

Disagreement in Metaphor Annotation of M exican S panish Science Tweets

S \'a nchez-Montero, Alec and Bel-Enguix, Gemma and Ojeda-Trueba, Sergio-Luis and Sierra, Gerardo. Disagreement in Metaphor Annotation of M exican S panish Science Tweets. Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation. 2025

work page 2025
[55]

2024 , eprint=

Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks? , author=. 2024 , eprint=

work page 2024
[56]

Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , pages=

The disagreement deconvolution: Bringing machine learning performance metrics in line with reality , author=. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , pages=. doi:https://doi.org/10.1145/3411764.3445423 , year=

work page doi:10.1145/3411764.3445423 2021
[57]

ACTOR : Active Learning with Annotator-specific Classification Heads to Embrace Human Label Variation

Wang, Xinpeng and Plank, Barbara. ACTOR : Active Learning with Annotator-specific Classification Heads to Embrace Human Label Variation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.126

work page doi:10.18653/v1/2023.emnlp-main.126 2023
[58]

Challenges for Toxic Comment Classification: An In-Depth Error Analysis

van Aken, Betty and Risch, Julian and Krestel, Ralf and L. Challenges for Toxic Comment Classification: An In-Depth Error Analysis. Proceedings of the 2nd Workshop on Abusive Language Online ( ALW 2). 2018. doi:10.18653/v1/W18-5105

work page doi:10.18653/v1/w18-5105 2018
[59]

Is a bunch of words enough to detect disagreement in hateful content?

Rizzi, Giulia and Rosso, Paolo and Fersini, Elisabetta. Is a bunch of words enough to detect disagreement in hateful content?. Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation. 2025

work page 2025
[60]

Semeval-2023 task 11: Learning with disagreements (lewidi),

Leonardelli, Elisa and Abercrombie, Gavin and Almanea, Dina and Basile, Valerio and Fornaciari, Tommaso and Plank, Barbara and Rieser, Verena and Uma, Alexandra and Poesio, Massimo. S em E val-2023 Task 11: Learning with Disagreements ( L e W i D i). Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). 2023. doi:10.18653/v...

work page doi:10.18653/v1/2023.semeval-1.314 2023
[61]

Comparing B ayesian Models of Annotation

Paun, Silviu and Carpenter, Bob and Chamberlain, Jon and Hovy, Dirk and Kruschwitz, Udo and Poesio, Massimo. Comparing B ayesian Models of Annotation. Transactions of the Association for Computational Linguistics. 2018. doi:10.1162/tacl_a_00040

work page doi:10.1162/tacl_a_00040 2018
[62]

Detecting Stance in Media On Global Warming

Luo, Yiwei and Card, Dallas and Jurafsky, Dan. Detecting Stance in Media On Global Warming. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.296

work page doi:10.18653/v1/2020.findings-emnlp.296 2020
[63]

Proceedings of the International AAAI Conference on Web and Social Media , volume=

Annobert: Effectively representing multiple annotators’ label choices to improve hate speech detection , author=. Proceedings of the International AAAI Conference on Web and Social Media , volume=

work page
[64]

, author=

Mining Annotator Perspectives from Hate Speech Corpora. , author=. NL4AI@ AI* IA , url=

work page
[65]

Evaluating the Capabilities of Large Language Models for Multi-label Emotion Understanding

Belay, Tadesse Destaw and Azime, Israel Abebe and Ayele, Abinew Ali and Sidorov, Grigori and Klakow, Dietrich and Slusallek, Philip and Kolesnikova, Olga and Yimam, Seid Muhie. Evaluating the Capabilities of Large Language Models for Multi-label Emotion Understanding. Proceedings of the 31st International Conference on Computational Linguistics. 2025

work page 2025
[66]

and Adelani, David Ifeoluwa and Mosbach, Marius and Klakow, Dietrich

Alabi, Jesujoba O. and Adelani, David Ifeoluwa and Mosbach, Marius and Klakow, Dietrich. Adapting Pre-trained Language Models to A frican Languages via Multilingual Adaptive Fine-Tuning. Proceedings of the 29th International Conference on Computational Linguistics. 2022

work page 2022
[67]

Overview of Second Shared Task on Sentiment Analysis in Code-mixed T amil and T ulu

Sambath Kumar, Lavanya and Hegde, Asha and Chakravarthi, Bharathi Raja and Shashirekha, Hosahalli and Natarajan, Rajeswari and Thavareesan, Sajeetha and Sakuntharaj, Ratnasingam and Durairaj, Thenmozhi and Kumaresan, Prasanna Kumar and Rajkumar, Charmathi. Overview of Second Shared Task on Sentiment Analysis in Code-mixed T amil and T ulu. Proceedings of ...

work page 2024
[68]

Overview of the Shared Task on Multimodal Hate Speech Detection in D ravidian languages: D ravidian L ang T ech@ NAACL 2025

G, Jyothish Lal and B, Premjith and Chakravarthi, Bharathi Raja and Rajiakodi, Saranya and B, Bharathi and Natarajan, Rajeswari and Rajalakshmi, Ratnavel. Overview of the Shared Task on Multimodal Hate Speech Detection in D ravidian languages: D ravidian L ang T ech@ NAACL 2025. Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologie...

work page 2025
[69]

Artificial intelligence and law , volume=

I beg to differ: how disagreement is handled in the annotation of legal machine learning data sets , author=. Artificial intelligence and law , volume=. 2024 , url =

work page 2024
[70]

XLM - T : Multilingual Language Models in T witter for Sentiment Analysis and Beyond

Barbieri, Francesco and Espinosa Anke, Luis and Camacho-Collados, Jose. XLM - T : Multilingual Language Models in T witter for Sentiment Analysis and Beyond. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022

work page 2022
[71]

2020 , eprint=

Unsupervised Cross-lingual Representation Learning at Scale , author=. 2020 , eprint=

work page 2020
[72]

Don`t Blame the Annotator: Bias Already Starts in the Annotation Instructions

Parmar, Mihir and Mishra, Swaroop and Geva, Mor and Baral, Chitta. Don`t Blame the Annotator: Bias Already Starts in the Annotation Instructions. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.130

work page doi:10.18653/v1/2023.eacl-main.130 2023
[73]

N aija S enti: A N igerian T witter Sentiment Corpus for Multilingual Sentiment Analysis

Muhammad, Shamsuddeen Hassan and Adelani, David Ifeoluwa and Ruder, Sebastian and Ahmad, Ibrahim Sa ' id and Abdulmumin, Idris and Bello, Bello Shehu and Choudhury, Monojit and Emezue, Chris Chinenye and Abdullahi, Saheed Salahudeen and Aremu, Anuoluwapo and Jorge, Al \'i pio and Brazdil, Pavel. N aija S enti: A N igerian T witter Sentiment Corpus for Mul...

work page 2022
[74]

arXiv preprint arXiv:2502.11926 , year=

BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages , author=. arXiv preprint arXiv:2502.11926 , year=

work page arXiv
[75]

Investigating Reasons for Disagreement in Natural Language Inference

Jiang, Nan-Jiang and de Marneffe, Marie-Catherine. Investigating Reasons for Disagreement in Natural Language Inference. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00523

work page doi:10.1162/tacl_a_00523 2022
[76]

Toward a perspectivist turn in ground truthing for predictive computing

Cabitza, Federico and Campagner, Andrea and Basile, Valerio , year=. Toward a Perspectivist Turn in Ground Truthing for Predictive Computing , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , publisher=. doi:10.1609/aaai.v37i6.25840 , number=

work page doi:10.1609/aaai.v37i6.25840
[77]

2024 , eprint=

Rethinking Emotion Annotations in the Era of Large Language Models , author=. 2024 , eprint=

work page 2024
[78]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch , journal =. The Measurement of Observer Agreement for Categorical Data , urldate =

work page
[79]

Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

Emollms: A series of emotional large language models and annotation tools for comprehensive affective analysis , author=. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

work page
[80]

Proceedings of the 37th Canadian Conference on Artificial Intelligence , year=

Instruction Tuning of LLMs for Multi-label Emotion Classification in Social Media Content , author=. Proceedings of the 37th Canadian Conference on Artificial Intelligence , year=

work page
[81]

Pseudo-Labeling With Large Language Models for Multi-Label Emotion Classification of French Tweets , year=

Malik, Usman and Bernard, Simon and Pauchet, Alexandre and Chatelain, Clément and Picot-Clémente, Romain and Cortinovis, Jérôme , journal=. Pseudo-Labeling With Large Language Models for Multi-Label Emotion Classification of French Tweets , year=

work page

Showing first 80 references.