pith. sign in

arXiv:1608.07836 [cs] , year =

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

Real world data differs radically from the benchmark corpora we use in natural language processing (NLP). As soon as we apply our technologies to the real world, performance drops. The reason for this problem is obvious: NLP models are trained on samples from a limited set of canonical varieties that are considered standard, most prominently English newswire. However, there are many dimensions, e.g., socio-demographics, language, genre, sentence type, etc. on which texts can differ from the standard. The solution is not obvious: we cannot control for all factors, and it is not clear how to best go beyond the current practice of training on homogeneous data from a single domain and language. In this paper, I review the notion of canonicity, and how it shapes our community's approach to language. I argue for leveraging what I call fortuitous data, i.e., non-obvious data that is hitherto neglected, hidden in plain sight, or raw data that needs to be refined. If we embrace the variety of this heterogeneous data by combining it with proper algorithms, we will not only produce more robust models, but will also enable adaptive language technology capable of addressing natural language variation.

fields

cs.CL 2

years

2026 2

verdicts

UNVERDICTED 2

representative citing papers

Task Decomposition for Efficient Annotation

cs.CL · 2026-06-23 · unverdicted · novelty 4.0

Decomposing annotation tasks using centers from centering theory reduces aggregate inferential load via a degrees-of-freedom model and enables better sub-task allocation.

citing papers explorer

Showing 2 of 2 citing papers.