The Icelandic Gigaword Corpus

  • Steinþór Steingrímsson The Árni Magnússon Institute for Icelandic Studies
Keywords: corpora, Icelandic, part-of-speech tagging, lemmatization

Abstract

In may 2018 a new text corpus, The Icelandic Gigaword Corpus, was launched. The first version of the corpus contains over 1.2 billion running words, PoS-tagged and lemmatized. Texts will be collected continually for the corpus and a new version published every year. Although the corpus is tailored for use in language technology and linguistic research, it can also be very useful for students of linguistics. It is accessible in a variety of ways. It can be searched in a graphical search interface, powered by Korp. N-grams can be compared in an n-gram viewer and the corpus is available for download with permissive licenses.

Published
2019-08-15
Section
Language News