Download latest version here.
The Icelandic Gigaword corpus (IGC) is a tagged corpus which means that each running word is accompanied by a morphosyntactic tag and lemma and each text is accompanied by bibliographic information. The corpus is intended for linguistic research and for use in Language Technology projects.
The current version (2022) consists of about 2429 million running words of text. Part of the corpus texts are official texts (e.g. parliamentary speeches, bills and resolutions, and adjudications). The corpus also contains big text collections from news media and social media, as well as texts from journals and books. Text collection for the corpus is ongoing with a new version published yearly.
To enable the use of the corpus in Language Technology projects copyright clearance has been secured for the texts used. Originally the idea was to secure permission from copyright owners to give access to the texts with Creative Commons licenses. All copyright holders could not agree to those terms. The texts are therefore either published under a CC BY license or a a special license developed for the Tagged Icelandic Corpus (MIM). All copyright holders have agreed that their material may be used free of licensing charges. Copyright owners that did not accept the CC BY license signed a special declaration developed for the Tagged Icelandic Corpus with necessary amendments.
The corpus is tagged by automatic means. The texts in IGC are divided into sentences and running words and then tagged and lemmatized. Tags and lemmas are not manually corrected.
The first three years IGC was distributed in two packages due to licensing reasons, but from 2021 the corpus has been divided into eight (2021) an later nine (2022) subcporpora:
The first three years IGC was distributed in two packages due to licensing reasons. Approximately half of the corpus was published under CC BY 4.0 and half under a custom license. Starting with the version of 2021 the corpus has been divided into several subcorpora, based on text domain, which are either distributed with a CC BY license or the custom license.
The main difference between the licenses is that the texts published under a custom license can not be republished. Both licenses allow building language models and publishing them as well as usage for other language technology purposes and linguistic research.
The latest version of the corpus is version 2022. It is distributed as nine individual subcorpora. Further information and links to each subcorpora can be found on CLARIN-IS repository.
IGC 2022 contains texts from until the end of the year 2021, in total 2429 million running words. It was tagged using ABLTagger v3.0.0 and lemmatized with Nefnir. The tagset used is MIM-GOLD 2.0.
All versions*:
Version | Year of publ. | Words (M) | POS-tagger | Tokenizer | Lemmatizer | MIM-GOLD Tagset | Links |
---|---|---|---|---|---|---|---|
IGC 2022 | 2022 | 2,429 | ABL-tagger 3.0.0 | Tokenizer | Nefnir | v. 2.0 |
IGC 2022 - unannotated IGC 2022 - annotated |
IGC 2021 | 2021 | 1,880 | ABL-tagger 2.0.4 | Tokenizer | Nefnir | v. 2.0 |
IGC 2021 |
IGC 2020 | 2020 | 1,532 | ABL-tagger 0.9 | Tokenizer | Nefnir | v. 1.0 |
IGC1 (843 million running words) MIM license IGC2 (712 million running words) CC BY 4.0 |
IGC 2018 | 2019 | 1,394 | IceStagger | IceNLP | Nefnir | v. 1.0 |
IGC1 (799 million running words) MIM license IGC2 (595 million running words) CC BY 4.0 |
IGC 2017 | 2018 | 1,259 | IceStagger | IceNLP | Nefnir | v. 1.0 |
IGC1 (716 million running words) MIM license IGC2 (543 million running words) CC BY 4.0 |
* There is no version with the suffix 2019 due to a change in naming conventions. The first two versions refer to the year of the most recent texts, but since then the year of publication has been used. This might have caused some confusion and if there is ever any refernece to IGC-2019 then it probably refers to IGC-2020.
Other versions:
IGC-Parl: Parliamentary speeches from 1911-2019. This is a subset of IGC 20.05, but contains additional meta-data on parliamentarians, political parties and more. IGC-Parl (219 million tokens) is published under CC BY 4.0.
Related Datasets:
IGC: Evaluation set 20.09, is manually curated to evaluate the accuracy of pos-tagging in nine different subcorpora of the IGC. IGC-evaluation set 20.09 contains 101.261 tokens and is published under the custom MIM license.
MIM-GOLD 20.05 is a gold standard for PoS-tagging Icelandic texts. It contains approximately 1 million running words with manually annotated PoS-tags. MIM-GOLD 20.05 uses a tagset revised in 2019-2020. Train/Test splits are also available. Previous versions of MIM-Gold are available, version 0.9 and 1.0.
Icelandic Tagged Corpus contains approximately 25 million running words. Further information here.
Icelandic Frequency Dictionary has been used for training and testing PoS-taggers for Icelandic since such work started. Training/Testing sets are available with various revision versions of the tagset. The current one is 20.05. Versions 18.10 and 12.11 are also available.
The corpus is available for download.
It is also accessible in a KWIC environment where the tags (linguistic annotation) can be used to define the search more accurately. The search gives a KWIC index and information about the source of each text example. The search interface is powered by the Swedish search interface Korp.
Word frequency information is available on a special website. It allows creating frequency lists based on various criteria.
N-gram frequency of up to 3-grams is also available in an n-gram viewer mode.
The following people have worked on the corpus:
Eiríkur Rögnvaldsson, project management
Sigrún Helgadóttir, project management and licensing
Steinþór Steingrímsson, project management, licensing, compilation and software development
Starkaður Barkarson, project management, compilation and software development
Gunnar Thor Örnólfsson, software development
Kristján Rúnarsson, software development
Hildur Hafsteinsdóttir, data collection and licensing
Þórdís Dröfn Andrésdóttir, data collection
Finnur Ágúst Ingimundarson, data collection
Árni Davíð Magnússon, data collection
If you use the Icelandic Gigaword Corpusin your published research, please cite this paper:
If you use the word frequency website or n-gram viewer please consider also citing this paper:
Initial work on the corpus, from 2015 to 2017, was carried out at the Árni Magnússon Institute for Icelandic Studies and was funded mostly by the Infrastructure Fund (no. 151110-0031, project manager Eiríkur Rögnvaldsson) and the Contribution grants fund (Mótframlagssjóður) at the University of Iceland. Further work has been funded by Ministry of Education and Culture and the Icelandic Language Technology Programme 2019-2023.
Publishers of media content and book publishers have cooperated in collecting texts, as well as the company Creditinfo which gave assistance in retrieving texts from radio and television and some web media and printed media.