@nora-ketschik

From 0 to 10 million annotated words: part-of-speech tagging for Middle High German

, and . Language Resources and Evaluation, 53 (4): 837-863 (2019)

Abstract

By building a part-of-speech (POS) tagger for Middle High German, we investigate strategies for dealing with a low resource, diverse and non-standard language in the domain of natural language processing. We highlight various aspects such as the data quantity needed for training and the influence of data quality on tagger performance. Since the lack of annotated resources poses a problem for training a tagger, we exemplify how existing resources can be adapted fruitfully to serve as additional training data. The resulting POS model achieves a tagging accuracy of about 91% on a diverse test set representing the different genres, time periods and varieties of MHG.

Links and resources

Tags

community