The Icelandic Gigaword Corpus

Download latest version here.

The Icelandic Gigaword corpus (IGC) is a tagged corpus which means that each running word is accompanied by a morphosyntactic tag and lemma and each text is accompanied by bibliographic information. The corpus is intended for linguistic research and for use in Language Technology projects.

The current version consists of about 1550 million running words of text. Part of the corpus texts are official texts (e.g. parliamentary speeches as far back as 1911, law text, adjudications). The corpus also contains big text collections from news media and various texts from the text collection of the Árni Magnússon Institute for Icelandic Studies. Text collection for the corpus is ongoing with a new version published yearly.

To enable the use of the corpus in Language Technology projects copyright clearance has been secured for the texts used. Originally the idea was to secure permission from copyright owners to give access to the texts with Creative Commons licenses . All copyright holders could not agree to those terms. The corpus is therefore divided into two parts, IGC1 and IGC2. IGC1 contains texts that can be used with a special license developed for the Tagged Icelandic Corpus (MIM). IGC2 contains official texts and texts that can be used with a CC BY license. All copyright holders have agreed that their material may be used free of licensing charges. Copyright owners that did not accept the CC BY license signed a special declaration developed for the Tagged Icelandic Corpus with necessary amendments for the IGC1.

The corpus is tagged by automatic means. The texts in IGC are divided into sentences and running words and then tagged and lemmatized. Tags and lemmas are not manually corrected.

Subcorpora

...

Licencing

The corpus is published in two packages due to licensing reasons. Approximately half of the corpus is published under CC BY 4.0 and half under a custom license.

The main difference between the licenses is that the texts published under a custom license can not be republished. Both licenses allow building language models and publishing them as well as usage for other language technology purposes and linguistic research.

Versions

The latest version of the corpus is version 20.05. Download part 1 and part 2.

IGC 20.05 is split in two. IGC1 (843 million running words) is published under the custom MIM license and IGC2 (712 million running words) is published under CC BY 4.0. Both licenses allow usage of the data for training and publishing language models and other language technology research.

Previous versions:

IGC version 2 (2018): IGC1 (799 million running words) is published under the custom MIM license and IGC2 (595 million running words) is published under CC BY 4.0.

IGC version 1 (2017): IGC1 (710 million running words) is published under the custom MIM license and IGC2 (543 million running words) is published under CC BY 4.0.

Other versions:

IGC-Parl: Parliamentary speeches from 1911-2019. This is a subset of IGC 20.05, but contains additional meta-data on parliamentarians, political parties and more. IGC-Parl (219 million tokens) is published under CC BY 4.0.

Related Datasets:

IGC: Evaluation set 20.09, is manually curated to evaluate the accuracy of pos-tagging in nine different subcorpora of the IGC. IGC-evaluation set 20.09 contains 101.261 tokens and is published under the custom MIM license.

MIM-GOLD 20.05 is a gold standarf for PoS-tagging Icelandic texts. It contains approximately 1 million running words with manually annotated PoS-tags. MIM-GOLD 20.05 uses a tagset revised in 2019-2020. Train/Test splits are also available. Previous versions of MIM-Gold are available, version 0.9 and 1.0.

Icelandic Tagged Corpus contains approximately 25 million running words. Further information here.

Icelandic Frequency Dictionary has been used for training and testing PoS-taggers for Icelandic since such work started. Training/Testing sets are available with various revision versions of the tagset. The current one is 20.05. Versions 18.10 and 12.11 are also available.

Using the Corpus

The corpus is available for download.

It is also accessible in a KWIC environment where the tags (linguistic annotation) can be used to define the search more accurately. The search gives a KWIC index and information about the source of each text example. The search interface is powered by the Swedish search interface Korp.

Word frequency information is available on a special website. It allows creating frequency lists based on various criteria.

N-gram frequency of up to 3-grams is also available in an n-gram viewer mode.

People

The following people have worked on the corpus:

Eiríkur Rögnvaldsson, project management
Sigrún Helgadóttir, project management and licensing
Steinþór Steingrímsson, project management, licensing, compilation and software development
Starkaður Barkarson, project management, compilation and software development
Gunnar Thor Örnólfsson, software development
Kristján Rúnarsson, software development
Hildur Hafsteinsdóttir, data collection and licensing
Þórdís Dröfn Andrésdóttir, data collection
Finnur Ingimarsson, data collection

Cite

If you use the Icelandic Gigaword Corpusin your published research, please cite this paper:

    @inproceedings{steingrimsson-etal-2018-risamalheild,
    title = "{R}isam{\'a}lheild: A Very Large {I}celandic Text Corpus",
    author = {Steingr{\'\i}msson, Stein{\th}{\'o}r and
    Helgad{\'o}ttir, Sigr{\'u}n and
    R{\"o}gnvaldsson, Eir{\'\i}kur and
    Barkarson, Starka{\dh}ur and
    Gu{\dh}nason, J{\'o}n},
    booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
    year = "2018",
    address = "Miyazaki, Japan",
}

If you use the word frequency website or n-gram viewer please consider also citing this paper:

    @inproceedings{steingrimsson-etal-2020-facilitating,
    title = "Facilitating Corpus Usage: Making {I}celandic Corpora More Accessible for Researchers and Language Users",
    author = {Steingr{\'\i}msson, Stein{\th}{\'o}r and
    Barkarson, Starka{\dh}ur and
    {\"O}rn{\'o}lfsson, Gunnar Thor},
    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
    year = "2020",
    address = "Marseille, France",
    pages = "3399--3405",
}

Cooperation and Financing

Initial work on the corpus, from 2015 to 2017, was carried out at the Árni Magnússon Institute for Icelandic Studies and was funded mostly by the Infrastructure Fund (no. 151110-0031, project manager Eiríkur Rögnvaldsson) and the Contribution grants fund (Mótframlagssjóður) at the University of Iceland. Further work has been funded by Ministry of Education and Culture and the Icelandic Language Technology Programme 2019-2023.

Publishers of media content and book publishers have cooperated in collecting texts, as well as the company Creditinfo which gave assistance in retrieving texts from radio and television and some web media and printed media.