The Icelandic Gigaword Corpus

Download latest version here.

The Icelandic Gigaword corpus (IGC) is a tagged corpus which means that each running word is accompanied by a morphosyntactic tag and lemma and each text is accompanied by bibliographic information. The corpus is intended for linguistic research and for use in Language Technology projects.

The current version (2022) consists of about 2429 million running words of text. Part of the corpus texts are official texts (e.g. parliamentary speeches, bills and resolutions, and adjudications). The corpus also contains big text collections from news media and social media, as well as texts from journals and books. Text collection for the corpus is ongoing with a new version published yearly.

To enable the use of the corpus in Language Technology projects copyright clearance has been secured for the texts used. Originally the idea was to secure permission from copyright owners to give access to the texts with Creative Commons licenses. All copyright holders could not agree to those terms. The texts are therefore either published under a CC BY license or a a special license developed for the Tagged Icelandic Corpus (MIM). All copyright holders have agreed that their material may be used free of licensing charges. Copyright owners that did not accept the CC BY license signed a special declaration developed for the Tagged Icelandic Corpus with necessary amendments.

The corpus is tagged by automatic means. The texts in IGC are divided into sentences and running words and then tagged and lemmatized. Tags and lemmas are not manually corrected.

Subcorpora

The first three years IGC was distributed in two packages due to licensing reasons, but from 2021 the corpus has been divided into eight (2021) an later nine (2022) subcporpora:

  • IGC-Adjud: Adjudications (CC BY license)
  • IGC-Books: Published books (MIM licence)
  • IGC-Laws: Law, bills and resolutions (CC BY licence)
  • IGC-Journals: Scientific/academic journals (CC BY licence)
  • IGC-News1: News (CC BY licence)
  • IGC-News2: News (MIM licence)
  • IGC-Parla: Parliamentary speeches (CC BY licence)
  • IGC-Social: Forums, blogs and tweets (CC BY licence)
  • IGC-Wiki: Texts from the Icelandic Wikipedia (CC BY licence)
In 2022 each subcorpora was published in two versions, one containing untokenized and unannotated text while the other one is both tokenized and annotated.

Licencing

The first three years IGC was distributed in two packages due to licensing reasons. Approximately half of the corpus was published under CC BY 4.0 and half under a custom license. Starting with the version of 2021 the corpus has been divided into several subcorpora, based on text domain, which are either distributed with a CC BY license or the custom license.

The main difference between the licenses is that the texts published under a custom license can not be republished. Both licenses allow building language models and publishing them as well as usage for other language technology purposes and linguistic research.

Versions

The latest version of the corpus is version 2022. It is distributed as nine individual subcorpora. Further information and links to each subcorpora can be found on CLARIN-IS repository.

IGC 2022 contains texts from until the end of the year 2021, in total 2429 million running words. It was tagged using ABLTagger v3.0.0 and lemmatized with Nefnir. The tagset used is MIM-GOLD 2.0.

All versions*:

Version Year of publ. Words (M) POS-tagger Tokenizer Lemmatizer MIM-GOLD Tagset Links
IGC 2022 2022 2,429 ABL-tagger 3.0.0 Tokenizer Nefnir v. 2.0 IGC 2022 - unannotated
IGC 2022 - annotated
IGC 2021 2021 1,880 ABL-tagger 2.0.4 Tokenizer Nefnir v. 2.0 IGC 2021
IGC 2020 2020 1,532 ABL-tagger 0.9 Tokenizer Nefnir v. 1.0 IGC1 (843 million running words) MIM license
IGC2 (712 million running words) CC BY 4.0
IGC 2018 2019 1,394 IceStagger IceNLP Nefnir v. 1.0 IGC1 (799 million running words) MIM license
IGC2 (595 million running words) CC BY 4.0
IGC 2017 2018 1,259 IceStagger IceNLP Nefnir v. 1.0 IGC1 (716 million running words) MIM license
IGC2 (543 million running words) CC BY 4.0

* There is no version with the suffix 2019 due to a change in naming conventions. The first two versions refer to the year of the most recent texts, but since then the year of publication has been used. This might have caused some confusion and if there is ever any refernece to IGC-2019 then it probably refers to IGC-2020.

Other versions:

IGC-Parl: Parliamentary speeches from 1911-2019. This is a subset of IGC 20.05, but contains additional meta-data on parliamentarians, political parties and more. IGC-Parl (219 million tokens) is published under CC BY 4.0.

Related Datasets:

IGC: Evaluation set 20.09, is manually curated to evaluate the accuracy of pos-tagging in nine different subcorpora of the IGC. IGC-evaluation set 20.09 contains 101.261 tokens and is published under the custom MIM license.

MIM-GOLD 20.05 is a gold standard for PoS-tagging Icelandic texts. It contains approximately 1 million running words with manually annotated PoS-tags. MIM-GOLD 20.05 uses a tagset revised in 2019-2020. Train/Test splits are also available. Previous versions of MIM-Gold are available, version 0.9 and 1.0.

Icelandic Tagged Corpus contains approximately 25 million running words. Further information here.

Icelandic Frequency Dictionary has been used for training and testing PoS-taggers for Icelandic since such work started. Training/Testing sets are available with various revision versions of the tagset. The current one is 20.05. Versions 18.10 and 12.11 are also available.

Using the Corpus

The corpus is available for download.

It is also accessible in a KWIC environment where the tags (linguistic annotation) can be used to define the search more accurately. The search gives a KWIC index and information about the source of each text example. The search interface is powered by the Swedish search interface Korp.

Word frequency information is available on a special website. It allows creating frequency lists based on various criteria.

N-gram frequency of up to 3-grams is also available in an n-gram viewer mode.

People

The following people have worked on the corpus:

Eiríkur Rögnvaldsson, project management
Sigrún Helgadóttir, project management and licensing
Steinþór Steingrímsson, project management, licensing, compilation and software development
Starkaður Barkarson, project management, compilation and software development
Gunnar Thor Örnólfsson, software development
Kristján Rúnarsson, software development
Hildur Hafsteinsdóttir, data collection and licensing
Þórdís Dröfn Andrésdóttir, data collection
Finnur Ágúst Ingimundarson, data collection
Árni Davíð Magnússon, data collection

Cite

If you use the Icelandic Gigaword Corpusin your published research, please cite this paper:

bib
Barkarson, Starkaður, Steinþór Steingrímsson and Hildur Hafsteinsdóttir. 2022. Evolving Large Text Corpora: Four Versions of the Icelandic Gigaword Corpus. Proceedings of the Language Resources and Evaluation Conference, pp. 2371-2381. Marseille, France.
bib
Steingrímsson, Steinþór, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. Proceedings of LREC 2018, pp. 4361-4366. Myazaki, Japan.

If you use the word frequency website or n-gram viewer please consider also citing this paper:

bib
Steingrímsson, Steinþór, Starkaður Barkarson and Gunnar Thor Örnólfsson. 2020. Facilitating Corpus Usage: Making Icelandic Corpora More Accessible for Researchers and Language Users. Proceedings of the 12th Language Resources and Evaluation Conference, pp. 3399-3405. Marseille, France.

Cooperation and Financing

Initial work on the corpus, from 2015 to 2017, was carried out at the Árni Magnússon Institute for Icelandic Studies and was funded mostly by the Infrastructure Fund (no. 151110-0031, project manager Eiríkur Rögnvaldsson) and the Contribution grants fund (Mótframlagssjóður) at the University of Iceland. Further work has been funded by Ministry of Education and Culture and the Icelandic Language Technology Programme 2019-2023.

Publishers of media content and book publishers have cooperated in collecting texts, as well as the company Creditinfo which gave assistance in retrieving texts from radio and television and some web media and printed media.