From 0 to 10 million annotated words: part-of-speech tagging for Middle High German
S. Schulz, and N. Ketschik. Language Resources and Evaluation, 53 (4):
837-863(2019)
Abstract
By building a part-of-speech (POS) tagger for Middle High German, we
investigate strategies for dealing with a low resource, diverse and non-standard
language in the domain of natural language processing. We highlight various
aspects such as the data quantity needed for training and the influence of data
quality on tagger performance. Since the lack of annotated resources poses a
problem for training a tagger, we exemplify how existing resources can be adapted
fruitfully to serve as additional training data. The resulting POS model achieves a
tagging accuracy of about 91% on a diverse test set representing the different
genres, time periods and varieties of MHG.
%0 Journal Article
%1 schulz2019million
%A Schulz, Sarah
%A Ketschik, Nora
%D 2019
%J Language Resources and Evaluation
%K Annotation Middle-High-German POS-Tagging
%N 4
%P 837-863
%T From 0 to 10 million annotated words: part-of-speech tagging for Middle High German
%U http://dblp.uni-trier.de/db/journals/lre/lre53.html#SchulzK19
%V 53
%X By building a part-of-speech (POS) tagger for Middle High German, we
investigate strategies for dealing with a low resource, diverse and non-standard
language in the domain of natural language processing. We highlight various
aspects such as the data quantity needed for training and the influence of data
quality on tagger performance. Since the lack of annotated resources poses a
problem for training a tagger, we exemplify how existing resources can be adapted
fruitfully to serve as additional training data. The resulting POS model achieves a
tagging accuracy of about 91% on a diverse test set representing the different
genres, time periods and varieties of MHG.
@article{schulz2019million,
abstract = {By building a part-of-speech (POS) tagger for Middle High German, we
investigate strategies for dealing with a low resource, diverse and non-standard
language in the domain of natural language processing. We highlight various
aspects such as the data quantity needed for training and the influence of data
quality on tagger performance. Since the lack of annotated resources poses a
problem for training a tagger, we exemplify how existing resources can be adapted
fruitfully to serve as additional training data. The resulting POS model achieves a
tagging accuracy of about 91% on a diverse test set representing the different
genres, time periods and varieties of MHG.},
added-at = {2020-03-23T21:09:47.000+0100},
author = {Schulz, Sarah and Ketschik, Nora},
biburl = {https://puma.ub.uni-stuttgart.de/bibtex/22be1e9a836990f38d0a186c35c817f13/nora-ketschik},
interhash = {e0b323aa797abfb88a2505020fe98e04},
intrahash = {2be1e9a836990f38d0a186c35c817f13},
journal = {Language Resources and Evaluation},
keywords = {Annotation Middle-High-German POS-Tagging},
number = 4,
pages = {837-863},
timestamp = {2020-03-23T20:09:47.000+0100},
title = {From 0 to 10 million annotated words: part-of-speech tagging for Middle High German},
url = {http://dblp.uni-trier.de/db/journals/lre/lre53.html#SchulzK19},
volume = 53,
year = 2019
}