Abstract: Recent initiatives by cultural heritage institutions in addressing outdated and offensive language used in their collections demonstrate the need for further understanding into when terms are problematic or contentious. This paper presents an annotated dataset of 2,715 unique samples of terms in context, drawn from a historical newspaper archive, collating 21,800 annotations of contentiousness from expert and crowd workers. We describe the contents of the corpus by analysing inter-rater agreement and differences between experts and crowd worker...
(read more)
Topics: 
Artificial intelligence
Natural language processing