Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Christian Greisinger; Christoph Leiter; Gagan Bhatia; Lisa Sophie Albertelli; Lotta Kiefer; Maria Kunilovskaya; Muhammad Arslan Manzoor; Sebastian Pohl; Steffen Eger; Subhadeep Roy

arxiv: 2606.02255 · v1 · pith:F5FM6QODnew · submitted 2026-06-01 · 💻 cs.CL · cs.AI

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Maria Kunilovskaya , Gagan Bhatia , Lisa Sophie Albertelli , Yanran Chen , Christian Greisinger , Lotta Kiefer , Christoph Leiter , Subhadeep Roy

show 5 more authors

Tewodros Achamaleh Muhammad Arslan Manzoor Sebastian Pohl Yufang Hou Steffen Eger

This is my paper

Pith reviewed 2026-06-28 14:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords human annotationNLP reporting practicesdataset constructionmodel evaluationreproducibilityannotation qualityLLM extraction

0 comments

The pith

NLP papers often report how annotations were collected but omit details needed to judge their quality and validity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits documentation of human annotation across ACL papers from 2018 to 2025. It finds operational facts such as recruitment methods, annotator expertise, and task volume are usually stated, while information required to assess validity—training procedures, language skills, pay, demographics, adjudication steps, and agreement scores—is frequently left out. These omissions appear more often in studies that use annotations to evaluate models. Reporting has become more complete over the years, yet remains inconsistent by venue and task type. The authors supply a taxonomy of reporting items plus a validated LLM pipeline that extracts the practices at scale from thousands of papers.

Core claim

Across 1,603 papers containing 2,667 annotation tasks, papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies; reporting has improved over time but remains uneven.

What carries the argument

A unified taxonomy of annotation-reporting practices together with an LLM-assisted extraction pipeline that reaches Krippendorff's alpha of 0.606 on a human-adjudicated gold standard of 41 papers.

If this is right

Model-evaluation studies require extra scrutiny on agreement and adjudication reporting.
The taxonomy can serve as a checklist for authors and reviewers.
The Annotated-llm dataset enables tracking of reporting changes over time.
Bare-minimum reporting standards derived from the taxonomy would improve reproducibility of annotation-based results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption of the taxonomy could reduce hidden variability in benchmark quality.
The same extraction approach could be applied to non-NLP fields that rely on human labels.
Journals could require the taxonomy items as a submission checklist to close the observed gaps.
Continued monitoring with the pipeline would show whether community guidelines are actually changing practice.

Load-bearing premise

The LLM pipeline extracts and classifies reporting practices accurately enough across the full collection of papers to support the reported patterns of presence and absence.

What would settle it

A fresh manual review of several hundred randomly sampled papers that shows the pipeline systematically misclassifies the presence or absence of validity-related items such as training or agreement scores.

Figures

Figures reproduced from arXiv: 2606.02255 by Christian Greisinger, Christoph Leiter, Gagan Bhatia, Lisa Sophie Albertelli, Lotta Kiefer, Maria Kunilovskaya, Muhammad Arslan Manzoor, Sebastian Pohl, Steffen Eger, Subhadeep Roy, Tewodros Achamaleh, Yanran Chen, Yufang Hou.

**Figure 1.** Figure 1: Task-level taxonomy for annotationreporting analysis. The taxonomy groups 25 reporting categories into seven aspects: general task description, agreement, workload, recruitment and qualifications, compensation, socio-demographics, and quality control. tasks required informed judgment. The final taxonomy captures seven aspects of human annotation reporting across 25 categories. The full description of cat… view at source ↗

**Figure 2.** Figure 2: Reporting patterns across annotation attributes. Panel A shows the overall percentage of annotation tasks for which each attribute is reported. Panel B shows the corresponding reporting rates by NLP topic. Darker green cells indicate higher reporting rates within a topic; orange and red cells indicate lower reporting rates. and authors are less often used as annotators (Figures 12 and 7, Appendix F). Thes… view at source ↗

**Figure 3.** Figure 3: Interrupted time-series analysis of REPORTAGE SCORE before and after the ACL Responsible NLP Checklist. Points show observed yearly mean REPORTAGE SCORE values, error bars indicate standard errors, the solid line shows the fitted trend, and the dashed line shows the counterfactual continuation of the pre-2022 trend. The vertical dashed line marks the checklist introduction. of annotation studies, potenti… view at source ↗

**Figure 4.** Figure 4: Mean REPORTAGE SCORE by intended use of human annotation. Lines show yearly mean REPORTAGE SCORE values for all annotation tasks, model-output evaluation tasks, and resource-creation tasks; error bars indicate standard errors. tended use labels from our taxonomy to compare annotation tasks whose primary purpose is resource creation, model output evaluation, or humanperformance benchmarking. These labels… view at source ↗

**Figure 5.** Figure 5: Significant proportional differences (∆%) between filtered and random samples across category–value pairs (Chi-square test, p < 0.05). Bars are sorted by effect size and centered at zero to indicate direction (over- vs under-representation in the filtered sample). For binary and binarized string fields, only the significant proportional deltas for the positive value (e.g., “yes”, “reported”) are shown to … view at source ↗

**Figure 8.** Figure 8: Distribution of quality control measures in [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 6.** Figure 6: Trends in reporting quality for socio-linguistic [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗

**Figure 7.** Figure 7: Language proficiency reporting rates, with [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

**Figure 10.** Figure 10: Distribution of annotator expertise in so [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

**Figure 13.** Figure 13: Interrupted time series analysis of reportage [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

**Figure 11.** Figure 11: Distribution of adjudication strategies in [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

**Figure 14.** Figure 14: Logistic regression of QC usage by anno [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗

**Figure 15.** Figure 15: Example excerpt from Haber et al. (2023) illustrating comprehensive reporting of annotator-related information, including the number of annotators, recruitment procedures, and annotators’ nation of origin [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗

**Figure 16.** Figure 16: Example excerpt from Haber et al. (2023) providing extensive reporting of annotation methodology and annotator characteristics, including annotators’ nation of origin, language proficiency, gender and age, as well as total number of annotated items and number of annotators per item [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗

**Figure 17.** Figure 17: Example excerpt from Haber et al. (2023) providing detailed reporting of inter-annotator agreement (IAA) metrics, scores, and disagreement resolution procedures. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗

**Figure 18.** Figure 18: The NLLG lab and NLP@IT:U members in Speinshart in February, 2026. [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗

read the original abstract

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Audit shows NLP papers often skip validity details on annotations but the LLM pipeline's moderate agreement leaves the exact rates uncertain.

read the letter

The main takeaway is that this audit documents widespread gaps in how NLP papers describe their human annotations, with validity-related details like compensation, demographics, and agreement often missing, particularly in model evaluation settings, while reporting has gotten a bit better since 2018.

They created a taxonomy for these details and built an LLM-based extractor. On a small set of 41 papers with human labels, the best model got Krippendorff's alpha of 0.606, which is close to the 0.585 between two humans. They then applied it to 1,603 papers yielding 2,667 tasks from 2018 to 2025.

The new part is the scope and the unified framework. It tracks changes over time and differences across venues and use cases. The results confirm earlier smaller observations but at a much larger scale, and the recommendations for bare-minimum reporting are concrete.

One soft spot is the lack of detailed error analysis for the pipeline. Moderate agreement without knowing which categories the model struggles with makes the headline omission rates less certain. Systematic mistakes in detecting omitted fields would directly impact the conclusions about validity details.

This work is aimed at the NLP community concerned with data quality and reproducibility. It is the kind of empirical study that can push for better standards. It deserves peer review because the methods are described and the data is large enough to be informative, even with the validation limits noted.

Referee Report

2 major / 2 minor

Summary. The paper introduces a unified taxonomy of annotation-reporting practices and uses an LLM-assisted extraction pipeline (validated on a 41-paper/72-task human-adjudicated gold standard reaching Krippendorff's α=0.606, comparable to human-human α=0.585) to audit reporting across 1,603 ACL-venue papers (2,667 tasks) from 2018–2025. It claims that operational details (recruitment, expertise, volume) are frequently reported while validity-related details (training, language proficiency, compensation, socio-demographics, adjudication, agreement) are often omitted—especially in model-evaluation studies—and that reporting has improved over time but remains uneven, offering a scalable framework and minimum reporting recommendations.

Significance. If the extraction pipeline proves reliable, the work supplies the first large-scale empirical map of annotation-reporting gaps in NLP, reusable datasets (Annotated-gold, Annotated-llm), and a taxonomy that could support reproducibility standards. The explicit comparison to human agreement and the task-level granularity are strengths.

major comments (2)

[validation results on Annotated-gold] Validation experiment (41-paper gold standard): Krippendorff's α=0.606 is reported as human-comparable, yet no per-category agreement scores, confusion matrices, or error rates are provided for individual reporting dimensions (e.g., compensation, socio-demographics, adjudication). Systematic LLM bias on any single dimension would directly distort the headline omission frequencies and the “improved over time” trend across the 2,667 tasks.
[methods and results on full corpus extraction] Pipeline application to full corpus (§ on Annotated-llm construction): The central observational claims rest on automated classification of 1,603 papers without reported held-out validation, inter-annotator agreement on a larger sample, or sensitivity analysis for extraction errors; the moderate α on the small gold set therefore leaves the quantitative results (e.g., differential omission in model-evaluation vs. dataset-construction tasks) on uncertain footing.

minor comments (2)

[task extraction definition] Clarify how “annotation task” boundaries are defined when a single paper contains multiple distinct annotation efforts; this affects the 2,667-task count and cross-task comparisons.
[taxonomy section] The taxonomy could be presented with an explicit mapping table to the extracted fields to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful review and constructive comments on our validation approach and the reliability of the extraction pipeline. We provide point-by-point responses below.

read point-by-point responses

Referee: [validation results on Annotated-gold] Validation experiment (41-paper gold standard): Krippendorff's α=0.606 is reported as human-comparable, yet no per-category agreement scores, confusion matrices, or error rates are provided for individual reporting dimensions (e.g., compensation, socio-demographics, adjudication). Systematic LLM bias on any single dimension would directly distort the headline omission frequencies and the “improved over time” trend across the 2,667 tasks.

Authors: We agree that aggregate agreement alone does not fully address potential dimension-specific biases. In the revised manuscript, we will add per-category Krippendorff's α scores for each of the reporting dimensions in the taxonomy, as well as selected confusion matrices and error analysis for key categories such as compensation and socio-demographics. This will allow better assessment of the pipeline's reliability on individual dimensions. revision: yes
Referee: [methods and results on full corpus extraction] Pipeline application to full corpus (§ on Annotated-llm construction): The central observational claims rest on automated classification of 1,603 papers without reported held-out validation, inter-annotator agreement on a larger sample, or sensitivity analysis for extraction errors; the moderate α on the small gold set therefore leaves the quantitative results (e.g., differential omission in model-evaluation vs. dataset-construction tasks) on uncertain footing.

Authors: The gold standard of 41 papers was constructed as a held-out validation set for the pipeline, and we selected it to be representative across venues, years, and task types. We acknowledge that the sample size is modest and that additional human validation on a larger subset of the corpus would be ideal. Due to the high cost of human annotation, we did not perform this. However, we will include a sensitivity analysis in the revision by simulating extraction errors and assessing the robustness of the reported trends, such as the improvement over time and differences by task type. We believe this, combined with the human-comparable aggregate agreement, provides reasonable support for the claims. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical extraction from external papers with external human validation

full rationale

This is a descriptive content-analysis study that applies an LLM pipeline to extract reporting details from 1,603 external ACL papers (2018-2025) and validates the pipeline against a separately human-adjudicated gold standard of 41 papers. No mathematical derivations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the chain from data extraction to reported frequencies and trends. The central results are direct tallies of omitted vs. reported fields across the extracted tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the assumption that the new taxonomy comprehensively captures relevant reporting dimensions and that the LLM pipeline produces reliable extractions at scale, with only partial validation on a small gold-standard set of 41 papers.

axioms (1)

domain assumption The unified taxonomy of annotation-reporting practices is comprehensive and appropriate for auditing NLP papers.
The entire extraction and analysis framework is built around this taxonomy.

invented entities (2)

unified taxonomy of annotation-reporting practices no independent evidence
purpose: To provide a standardized framework for extracting and categorizing annotation details
Newly introduced in the paper to enable the audit.
Annotated-llm dataset no independent evidence
purpose: Large-scale collection of extracted annotation tasks from ACL papers
Constructed via the LLM pipeline for the reported analysis.

pith-pipeline@v0.9.1-grok · 5862 in / 1232 out tokens · 36316 ms · 2026-06-28T14:58:00.520507+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Language Models are Few-Shot Learners

WiC-TSV: An evaluation benchmark for target sense verification of words in context. InProceed- ings of the 16th Conference of the European Chap- ter of the Association for Computational Linguistics: Main V olume, pages 1635–1645, Online. Association for Computational Linguistics. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Praf...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[2]

Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation

NeuroTrialNER: An annotated corpus for neu- rological diseases and therapies in clinical trial reg- istries. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18868–18890, Miami, Florida, USA. Associa- tion for Computational Linguistics. 10 Steffen Eger, Yong Cao, Jennifer D’Souza, Andreas Geiger, Christian Gr...

work page arXiv 2024
[3]

Preprint, arXiv:2505.24712

Heseia: A community-based dataset for eval- uating social biases in large language models, co- designed in real school settings in latin america. Preprint, arXiv:2505.24712. Aiqi Jiang, Nikolas Vitsakis, Tanvi Dinkar, Gavin Aber- crombie, and Ioannis Konstas. 2024. Re-examining sexism and misogyny classification with annotator attitudes. InFindings of the...

work page arXiv 2024
[4]

11 Antonia Karamolegkou, Sandrine Schiller Hansen, Ariadni Christopoulou, Filippos Stamatiou, Anne Lauscher, and Anders Søgaard

Human-in-the-loop synthetic text data in- spection with provenance tracking.Preprint, arXiv:2404.18881. 11 Antonia Karamolegkou, Sandrine Schiller Hansen, Ariadni Christopoulou, Filippos Stamatiou, Anne Lauscher, and Anders Søgaard. 2025. Ethical con- cern identification in NLP: A corpus of ACL Anthol- ogy ethics statements. InProceedings of the 2025 Conf...

work page arXiv 2025
[5]

Keith Tyser, Ben Segev, Gaston Longhitano, Xin-Yu Zhang, Zachary Meeks, Jason Lee, Uday Garg, Nicholas Belsten, Avi Shporer, Madeleine Udell, Dov Te’eni, and Iddo Drori

Visbias: Measuring explicit and implicit so- cial biases in vision language models.Preprint, arXiv:2503.07575. Keith Tyser, Ben Segev, Gaston Longhitano, Xin-Yu Zhang, Zachary Meeks, Jason Lee, Uday Garg, Nicholas Belsten, Avi Shporer, Madeleine Udell, Dov Te’eni, and Iddo Drori. 2024. Ai-driven review sys- tems: Evaluating llms in scalable and bias-aware...

work page arXiv 2024
[6]

yes”, “reported

Bridging ai and science: Implications from a large-scale literature analysis of ai4science.Preprint, arXiv:2412.09628. Limao Xiong, Jie Zhou, Qunxi Zhu, Xiao Wang, Yuan- bin Wu, Qi Zhang, Tao Gui, Xuanjing Huang, Jin Ma, and Ying Shan. 2023. A confidence-based partial la- bel learning model for crowd-annotated named entity recognition. InFindings of the A...

work page arXiv 2023
[7]

hate speech detection vs

**Different annotation objective** -- distinct label spaces or annotation goals (e.g. hate speech detection vs. stance detection)
[8]

experts for one subtask, crowd for another)

**Different annotator pools** -- different groups used for different tasks (e.g. experts for one subtask, crowd for another)
[9]

**Different recruitment / compensation** -- one task paid, another unpaid; or recruited via different channels
[10]

**Different annotation units** -- sentence-level vs document-level; word-level vs sentence-level
[11]

A paper that runs hate speech labeling and then target-type labeling on the same dataset has TWO experiments, not one

**Separate dedicated annotation rounds** -- the paper describes multiple distinct rounds, each with its own guidelines, annotators, or IAA Real-paper SPLIT examples: * binary attack classification / language classification / attack target classification -> 3 experiments (different label spaces) * summarization generation / summarization rating -> 2 experi...
[12]

Does the paper describe any OTHER annotation task (different labels, different annotators, different purpose in a different paper section)?
[13]

Experiment 2

Are there separate sections titled "Experiment 2", "Human Evaluation", "Quality Annotation", "Pilot Study"?
[14]

Did any OTHER group of humans label, rate, or judge data in this paper?
[15]

### CRITICAL: Avoid These Common Mistakes

Is there a text production task AND a separate evaluation/rating task performed by humans? If the answer to ANY of these is yes -> add another experiment. ### CRITICAL: Avoid These Common Mistakes
[16]

Not reported

**Do NOT use "Not reported" for demographic fields** -- use "No". These fields ONLY accept "Yes" or "No"
[17]

Majority vote

**disagreement_resolution_adjudication must be one of the specific methods** -- "Majority vote", "Expert adjudication", "Third annotator", "Consensus discussion", "Soft labels / Distributional labels", "Weighted voting", "Other / mixed", or "None"
[18]

**Do NOT infer language proficiency** from the dataset language or annotator country alone (exception: ethnicity matching task language -> "Native")
[19]

**Do NOT infer compensation amounts** -- but DO infer "Free" when authors annotate their own data, or mention voluntary effort
[20]

**Do NOT assume guidelines are released** just because the paper describes the task in prose -- there must be a dedicated document, appendix, URL, or cited source
[21]

**Do NOT calculate workload fields** from other numbers -- only extract explicitly stated values
[22]

Crowdsourcing

**For recruitment**, use the broad categories: "Crowdsourcing", "Authors", "Other", "Mixed", "Not reported" -- do NOT use platform-specific names
[23]

NA"** -- use

**For compensation, do NOT use "NA"** -- use "Not reported" if compensation is unmentioned
[24]

General Language/Knowledge task

**Do NOT confuse expertise_level "General Language/Knowledge task" and "Not reported"**: "General 26 Language/Knowledge task" = task needs no specialization at all; "Not reported" = task could need specialization but paper doesn't specify annotators'expertise
[25]

Do not output "NA" for a real metric until you have checked all tables

**For IAA values, check tables** -- numeric values are very often in tables, not the prose. Do not output "NA" for a real metric until you have checked all tables
[26]

**For total_annotated_items, use the annotated subset size**, not the total corpus/dataset size -- these can differ substantially
[27]

sentiment rating

**subtask_name must name the ANNOTATION task**, not the downstream NLP modelling task -- "sentiment rating" not "sentiment classifier"; "NER tagging" not "NER model"
[28]

Not reported

**For every numerical value, record evidence** in the `numerical_evidence`object -- do NOT leave it empty when you extracted a non-"Not reported" value. ### Field Interdependencies -- APPLY THESE BEFORE FINALISING OUTPUT These conditional rules take PRECEDENCE over all other guidance. Check each one after filling every field. **A. iaa_value is fully deter...

2018

[1] [1]

Language Models are Few-Shot Learners

WiC-TSV: An evaluation benchmark for target sense verification of words in context. InProceed- ings of the 16th Conference of the European Chap- ter of the Association for Computational Linguistics: Main V olume, pages 1635–1645, Online. Association for Computational Linguistics. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Praf...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[2] [2]

Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation

NeuroTrialNER: An annotated corpus for neu- rological diseases and therapies in clinical trial reg- istries. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18868–18890, Miami, Florida, USA. Associa- tion for Computational Linguistics. 10 Steffen Eger, Yong Cao, Jennifer D’Souza, Andreas Geiger, Christian Gr...

work page arXiv 2024

[3] [3]

Preprint, arXiv:2505.24712

Heseia: A community-based dataset for eval- uating social biases in large language models, co- designed in real school settings in latin america. Preprint, arXiv:2505.24712. Aiqi Jiang, Nikolas Vitsakis, Tanvi Dinkar, Gavin Aber- crombie, and Ioannis Konstas. 2024. Re-examining sexism and misogyny classification with annotator attitudes. InFindings of the...

work page arXiv 2024

[4] [4]

11 Antonia Karamolegkou, Sandrine Schiller Hansen, Ariadni Christopoulou, Filippos Stamatiou, Anne Lauscher, and Anders Søgaard

Human-in-the-loop synthetic text data in- spection with provenance tracking.Preprint, arXiv:2404.18881. 11 Antonia Karamolegkou, Sandrine Schiller Hansen, Ariadni Christopoulou, Filippos Stamatiou, Anne Lauscher, and Anders Søgaard. 2025. Ethical con- cern identification in NLP: A corpus of ACL Anthol- ogy ethics statements. InProceedings of the 2025 Conf...

work page arXiv 2025

[5] [5]

Keith Tyser, Ben Segev, Gaston Longhitano, Xin-Yu Zhang, Zachary Meeks, Jason Lee, Uday Garg, Nicholas Belsten, Avi Shporer, Madeleine Udell, Dov Te’eni, and Iddo Drori

Visbias: Measuring explicit and implicit so- cial biases in vision language models.Preprint, arXiv:2503.07575. Keith Tyser, Ben Segev, Gaston Longhitano, Xin-Yu Zhang, Zachary Meeks, Jason Lee, Uday Garg, Nicholas Belsten, Avi Shporer, Madeleine Udell, Dov Te’eni, and Iddo Drori. 2024. Ai-driven review sys- tems: Evaluating llms in scalable and bias-aware...

work page arXiv 2024

[6] [6]

yes”, “reported

Bridging ai and science: Implications from a large-scale literature analysis of ai4science.Preprint, arXiv:2412.09628. Limao Xiong, Jie Zhou, Qunxi Zhu, Xiao Wang, Yuan- bin Wu, Qi Zhang, Tao Gui, Xuanjing Huang, Jin Ma, and Ying Shan. 2023. A confidence-based partial la- bel learning model for crowd-annotated named entity recognition. InFindings of the A...

work page arXiv 2023

[7] [7]

hate speech detection vs

**Different annotation objective** -- distinct label spaces or annotation goals (e.g. hate speech detection vs. stance detection)

[8] [8]

experts for one subtask, crowd for another)

**Different annotator pools** -- different groups used for different tasks (e.g. experts for one subtask, crowd for another)

[9] [9]

**Different recruitment / compensation** -- one task paid, another unpaid; or recruited via different channels

[10] [10]

**Different annotation units** -- sentence-level vs document-level; word-level vs sentence-level

[11] [11]

A paper that runs hate speech labeling and then target-type labeling on the same dataset has TWO experiments, not one

**Separate dedicated annotation rounds** -- the paper describes multiple distinct rounds, each with its own guidelines, annotators, or IAA Real-paper SPLIT examples: * binary attack classification / language classification / attack target classification -> 3 experiments (different label spaces) * summarization generation / summarization rating -> 2 experi...

[12] [12]

Does the paper describe any OTHER annotation task (different labels, different annotators, different purpose in a different paper section)?

[13] [13]

Experiment 2

Are there separate sections titled "Experiment 2", "Human Evaluation", "Quality Annotation", "Pilot Study"?

[14] [14]

Did any OTHER group of humans label, rate, or judge data in this paper?

[15] [15]

### CRITICAL: Avoid These Common Mistakes

Is there a text production task AND a separate evaluation/rating task performed by humans? If the answer to ANY of these is yes -> add another experiment. ### CRITICAL: Avoid These Common Mistakes

[16] [16]

Not reported

**Do NOT use "Not reported" for demographic fields** -- use "No". These fields ONLY accept "Yes" or "No"

[17] [17]

Majority vote

**disagreement_resolution_adjudication must be one of the specific methods** -- "Majority vote", "Expert adjudication", "Third annotator", "Consensus discussion", "Soft labels / Distributional labels", "Weighted voting", "Other / mixed", or "None"

[18] [18]

**Do NOT infer language proficiency** from the dataset language or annotator country alone (exception: ethnicity matching task language -> "Native")

[19] [19]

**Do NOT infer compensation amounts** -- but DO infer "Free" when authors annotate their own data, or mention voluntary effort

[20] [20]

**Do NOT assume guidelines are released** just because the paper describes the task in prose -- there must be a dedicated document, appendix, URL, or cited source

[21] [21]

**Do NOT calculate workload fields** from other numbers -- only extract explicitly stated values

[22] [22]

Crowdsourcing

**For recruitment**, use the broad categories: "Crowdsourcing", "Authors", "Other", "Mixed", "Not reported" -- do NOT use platform-specific names

[23] [23]

NA"** -- use

**For compensation, do NOT use "NA"** -- use "Not reported" if compensation is unmentioned

[24] [24]

General Language/Knowledge task

**Do NOT confuse expertise_level "General Language/Knowledge task" and "Not reported"**: "General 26 Language/Knowledge task" = task needs no specialization at all; "Not reported" = task could need specialization but paper doesn't specify annotators'expertise

[25] [25]

Do not output "NA" for a real metric until you have checked all tables

**For IAA values, check tables** -- numeric values are very often in tables, not the prose. Do not output "NA" for a real metric until you have checked all tables

[26] [26]

**For total_annotated_items, use the annotated subset size**, not the total corpus/dataset size -- these can differ substantially

[27] [27]

sentiment rating

**subtask_name must name the ANNOTATION task**, not the downstream NLP modelling task -- "sentiment rating" not "sentiment classifier"; "NER tagging" not "NER model"

[28] [28]

Not reported

**For every numerical value, record evidence** in the `numerical_evidence`object -- do NOT leave it empty when you extracted a non-"Not reported" value. ### Field Interdependencies -- APPLY THESE BEFORE FINALISING OUTPUT These conditional rules take PRECEDENCE over all other guidance. Check each one after filling every field. **A. iaa_value is fully deter...

2018