arXiv:1608.07836 [cs] , year =

Plank, Barbara , title = · 2016 · cs.CL · arXiv 1608.07836

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Real world data differs radically from the benchmark corpora we use in natural language processing (NLP). As soon as we apply our technologies to the real world, performance drops. The reason for this problem is obvious: NLP models are trained on samples from a limited set of canonical varieties that are considered standard, most prominently English newswire. However, there are many dimensions, e.g., socio-demographics, language, genre, sentence type, etc. on which texts can differ from the standard. The solution is not obvious: we cannot control for all factors, and it is not clear how to best go beyond the current practice of training on homogeneous data from a single domain and language. In this paper, I review the notion of canonicity, and how it shapes our community's approach to language. I argue for leveraging what I call fortuitous data, i.e., non-obvious data that is hitherto neglected, hidden in plain sight, or raw data that needs to be refined. If we embrace the variety of this heterogeneous data by combining it with proper algorithms, we will not only produce more robust models, but will also enable adaptive language technology capable of addressing natural language variation.

representative citing papers

Parser agreement and disagreement in L2 Korean UD: Implications for human-in-the-loop annotation

cs.CL · 2026-05-07 · unverdicted · novelty 5.0

Parser agreement between two adapted models serves as a reliable proxy for human correctness in L2 Korean UD annotation, with disagreements clustering in predictable linguistic areas like grammatical relations and clause boundaries.

Task Decomposition for Efficient Annotation

cs.CL · 2026-06-23 · unverdicted · novelty 4.0

Decomposing annotation tasks using centers from centering theory reduces aggregate inferential load via a degrees-of-freedom model and enables better sub-task allocation.

citing papers explorer

Showing 2 of 2 citing papers.

Parser agreement and disagreement in L2 Korean UD: Implications for human-in-the-loop annotation cs.CL · 2026-05-07 · unverdicted · none · ref 35
Parser agreement between two adapted models serves as a reliable proxy for human correctness in L2 Korean UD annotation, with disagreements clustering in predictable linguistic areas like grammatical relations and clause boundaries.
Task Decomposition for Efficient Annotation cs.CL · 2026-06-23 · unverdicted · none · ref 88 · internal anchor
Decomposing annotation tasks using centers from centering theory reduces aggregate inferential load via a degrees-of-freedom model and enables better sub-task allocation.

arXiv:1608.07836 [cs] , year =

fields

years

verdicts

representative citing papers

citing papers explorer