pith. sign in

arxiv: 1907.07228 · v1 · pith:DZLMHFAPnew · submitted 2019-07-16 · 💻 cs.SI · cs.LG

Modeling Human Annotation Errors to Design Bias-Aware Systems for Social Stream Processing

Pith reviewed 2026-05-24 20:21 UTC · model grok-4.3

classification 💻 cs.SI cs.LG
keywords human annotation errorsannotation scheduleactive learningsocial media analyticsbias mitigationcrisis classificationmachine learningreal-time processing
0
0 comments X

The pith

The order in which social media posts are shown to annotators affects labeling quality and can be adjusted locally to improve machine learning accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that human annotation quality on social media data depends on the sequence of instances presented to annotators, called the annotation schedule. Small local changes to this ordering can reduce errors and produce more accurate labels for training machine learning systems. This matters for real-time analytics of social streams, especially during crises where timely classification is needed. The authors introduce an active learning algorithm that selects annotation schedules while remaining robust to certain human error patterns. Experiments on crisis post classification tasks confirm gains in classifier accuracy and greater awareness of annotation biases.

Core claim

Human annotation quality is dependent on the ordering of instances shown to annotators (the annotation schedule) and can be improved by local changes in the instance ordering, yielding a more accurate annotation of the data stream for efficient real-time social media analytics. An error-mitigating active learning algorithm is proposed that is robust with respect to some cases of human errors when deciding an annotation schedule. Validation through experiments on classification of relevant social media posts during crises shows increased machine learning accuracy and awareness of potential biases in human learning that may affect the automated classifier.

What carries the argument

The annotation schedule (the ordering of instances presented to annotators), optimized by an error-mitigating active learning algorithm that accounts for human error dependence on presentation order.

If this is right

  • Machine learning classifiers for social media streams achieve higher accuracy when trained on labels produced under optimized schedules.
  • Automated systems gain awareness of biases that originate in human annotation processes and can propagate to models.
  • Real-time social media analytics during crises becomes more efficient due to higher-quality training data.
  • Active learning methods can remain effective even when some human errors occur in schedule selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ordering effects could appear in annotation tasks outside crisis domains, such as general sentiment labeling or event detection.
  • Standard active learning pipelines may benefit from treating schedule optimization as a routine step rather than an optional add-on.
  • Interfaces for crowd annotation could incorporate dynamic reordering to limit systematic biases in collected datasets.

Load-bearing premise

The modeled dependence of human errors on annotation schedule is accurate enough to let the algorithm produce robust improvements, and results from crisis classification experiments will generalize to other settings.

What would settle it

A side-by-side test on social media crisis posts where an optimized annotation schedule produces no measurable gain in label accuracy or downstream classifier performance compared with a random schedule.

Figures

Figures reproduced from arXiv: 1907.07228 by Carlos Castillo, Hemant Purohit, Rahul Pandey.

Figure 1
Figure 1. Figure 1: The Ebbinghaus Curve for forgetting behavior of humans, as described [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: AUC score of mitigation algorithms for hurricane datasets, showing superior performance of error-mitigating sampling in the case of forgetting errors. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

High-quality human annotations are necessary to create effective machine learning systems for social media. Low-quality human annotations indirectly contribute to the creation of inaccurate or biased learning systems. We show that human annotation quality is dependent on the ordering of instances shown to annotators (referred as 'annotation schedule'), and can be improved by local changes in the instance ordering provided to the annotators, yielding a more accurate annotation of the data stream for efficient real-time social media analytics. We propose an error-mitigating active learning algorithm that is robust with respect to some cases of human errors when deciding an annotation schedule. We validate the human error model and evaluate the proposed algorithm against strong baselines by experimenting on classification tasks of relevant social media posts during crises. According to these experiments, considering the order in which data instances are presented to human annotators leads to both an increase in accuracy for machine learning and awareness toward some potential biases in human learning that may affect the automated classifier.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that human annotation quality in social media streams depends on the ordering of instances shown to annotators ('annotation schedule'), and that local changes to this ordering can improve annotation accuracy for real-time analytics. It proposes an error-mitigating active learning algorithm robust to certain human errors when selecting the schedule. The human error model and algorithm are validated via experiments on crisis-related social media post classification tasks, yielding higher machine learning accuracy and greater awareness of potential human biases.

Significance. If validated, the work could meaningfully improve labeled data quality for streaming social media ML systems, especially in time-sensitive domains like crisis informatics, by treating annotation order as a controllable variable rather than assuming uniform annotator performance. The empirical focus on crisis classification provides a practical test case; reproducible code or parameter-free derivations would further strengthen its contribution.

minor comments (2)
  1. [Abstract] The abstract refers to validation 'against strong baselines' and 'increase in accuracy' but provides no quantitative details, dataset sizes, or statistical tests; these should be summarized with effect sizes in the abstract or §Experiments.
  2. [Algorithm / Experiments] The claim that the algorithm is 'robust with respect to some cases of human errors' would benefit from an explicit enumeration of the error cases considered and the conditions under which robustness holds, ideally with a dedicated subsection or table.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the manuscript and for recognizing its potential significance in improving labeled data quality for streaming social media ML systems by treating annotation order as a controllable variable. The recommendation of 'uncertain' is noted. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claims rest on an empirical model of annotation errors derived from experiments on crisis post classification tasks, together with validation of an active learning algorithm against baselines. No equations, fitted parameters, or self-citations are presented in the abstract or described claims that reduce the target result to a definition or input by construction. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based only on the abstract, no specific free parameters, axioms, or invented entities are identifiable. The contribution appears to be an empirical study and algorithm proposal without explicit mathematical derivations or new postulated entities.

pith-pipeline@v0.9.0 · 5694 in / 1227 out tokens · 32664 ms · 2026-05-24T20:21:18.003214+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    Engineering Crowdsourced Stream Processing Systems

    M. Imran, I. Lykourentzou, Y . Naudet, and C. Castillo, “Engineering crowdsourced stream processing systems,” arXiv:1310.5463, 2013

  2. [2]

    Design patterns for hybrid algorithmic- crowdsourcing workflows,

    C. Lofi and K. El Maarry, “Design patterns for hybrid algorithmic- crowdsourcing workflows,” inIEEE Business Informatics, 2014, pp. 1–8

  3. [3]

    Memory: A contribution to experimental psychology,

    H. Ebbinghaus, “Memory: A contribution to experimental psychology,” Annals of neurosciences , vol. 20, no. 4, p. 155, 2013

  4. [4]

    Quantifying the impact of cognitive biases in question-answering systems,

    K. Burghardt, T. Hogg, and K. Lerman, “Quantifying the impact of cognitive biases in question-answering systems,” in AAAI ICWSM’18 , 2018, pp. 568–571

  5. [5]

    Experiences surveying the crowd: Reflections on methods, participation, and reliability,

    C. C. Marshall and F. M. Shipman, “Experiences surveying the crowd: Reflections on methods, participation, and reliability,” in ACM Web- Sci’13, 2013, pp. 234–243

  6. [6]

    Ranking of social media alerts with workload bounds in emergency operation centers,

    H. Purohit, C. Castillo, M. Imran, and R. Pandey, “Ranking of social media alerts with workload bounds in emergency operation centers,” in IEEE/WIC/ACM WebIntelligence’18. IEEE, 2018, pp. 206–213

  7. [7]

    Reason, Human error

    J. Reason, Human error. Cambridge university press, 1990

  8. [8]

    A cognitive taxonomy of medical errors,

    J. Zhang, V . L. Patel, T. R. Johnson, and E. H. Shortliffe, “A cognitive taxonomy of medical errors,” Journal of biomedical informatics , vol. 37, no. 3, pp. 193–204, 2004

  9. [9]

    A survey on concept drift adaptation,

    J. Gama, I. ˇZliobait˙e, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A survey on concept drift adaptation,” ACM computing surveys (CSUR) , vol. 46, no. 4, p. 44, 2014

  10. [10]

    Adapting dynamic classifier selection for concept drift,

    P. R. Almeida, L. S. Oliveira, A. S. Britto Jr, and R. Sabourin, “Adapting dynamic classifier selection for concept drift,” Expert Systems with Applications, vol. 104, pp. 67–85, 2018

  11. [11]

    Active learning with drifting streaming data,

    I. ˇZliobait˙e, A. Bifet, B. Pfahringer, and G. Holmes, “Active learning with drifting streaming data,” IEEE transactions on neural networks and learning systems , vol. 25, no. 1, pp. 27–39, 2014

  12. [12]

    Crisismmd: Multimodal twitter datasets from natural disasters,

    F. Alam, F. Ofli, and M. Imran, “Crisismmd: Multimodal twitter datasets from natural disasters,” in AAAI ICWSM’18 , 2018, pp. 465–473