The Icelandic Gigaword Corpus

Download latest version here.

The Icelandic Gigaword corpus (IGC) is a tagged corpus which means that each running word is accompanied by a morphosyntactic tag and lemma and each text is accompanied by bibliographic information. The corpus is intended for linguistic research and for use in Language Technology projects.

Part of the corpus texts are official texts (e.g. parliamentary speeches, bills and resolutions, and adjudications). The corpus also contains big text collections from news media and social media, as well as texts from journals and books. Text collection for the corpus is ongoing with a new version published yearly.

The last published version, IGC-2024ext, was an extension to the previous version (2022) and mainly contains texts from the years 2022 and 2023. It contanis roughly 162 million running words

Version 2022 consists of about 2,429 million running words of text.

To enable the use of the corpus in Language Technology projects copyright clearance has been secured for the texts used. Originally the idea was to secure permission from copyright owners to give access to the texts with Creative Commons licenses. All copyright holders could not agree to those terms. The texts are therefore either published under a CC BY license or a a special license developed for the Tagged Icelandic Corpus (MIM). All copyright holders have agreed that their material may be used free of licensing charges. Copyright owners that did not accept the CC BY license signed a special declaration developed for the Tagged Icelandic Corpus with necessary amendments.

The corpus is tagged by automatic means. The texts in IGC are divided into sentences and running words and then tagged and lemmatized. Tags and lemmas are not manually corrected.

Subcorpora

The first three years IGC was distributed in two packages due to licensing reasons, but from 2021 the corpus has been divided into eight (2021) an later nine (2022) subcporpora:

IGC-Adjud: Adjudications (CC BY license)
IGC-Books: Published books (MIM licence)
IGC-Laws: Law, bills and resolutions (CC BY licence)
IGC-Journals: Scientific/academic journals (CC BY licence)
IGC-News1: News (CC BY licence)
IGC-News2: News (MIM licence)
IGC-Parla: Parliamentary speeches (CC BY licence)
IGC-Social: Forums, blogs and tweets (CC BY licence)
IGC-Wiki: Texts from the Icelandic Wikipedia (CC BY licence)

In 2022 each subcorpora was published in two versions, one containing untokenized and unannotated text while the other one is both tokenized and annotated.

Licencing

The first three years IGC was distributed in two packages due to licensing reasons. Approximately half of the corpus was published under CC BY 4.0 and half under a custom license. Starting with the version of 2021 the corpus has been divided into several subcorpora, based on text domain, which are either distributed with a CC BY license or the custom license.

The main difference between the licenses is that the texts published under a custom license can not be republished. Both licenses allow building language models and publishing them as well as usage for other language technology purposes and linguistic research.

Versions

The latest version of the corpus is version 2024 which is an extension to version 2022, mainly containing texst from the year 2022 and 2023*. Version 2022 is distributed as nine individual subcorpora while version 2024 only contains new data from five of those subcorpora (IGC-Adjud, IGC-Law, IGC-News1, IGC-News2 and IGC-Parla). Further information and links to each subcorpora can be found on CLARIN-IS repository.

IGC 2022 contains texts from until the end of the year 2021, in total 2429 million running words. The newest versions were tagged using ABLTagger v3.0.0 and lemmatized with Nefnir. The tagset used is MIM-GOLD 2.0.

All versions**:

Version	Year of publ.	Words (M)	POS-tagger	Tokenizer	Lemmatizer	MIM-GOLD Tagset	Links
IGC 2024 (ext.)	2024	162	ABL-tagger 3.0.0	Tokenizer	Nefnir	v. 2.0	IGC 2024ext - unannotated IGC 2024ext - annotated
IGC 2022	2022	2,429	ABL-tagger 3.0.0	Tokenizer	Nefnir	v. 2.0	IGC 2022 - unannotated IGC 2022 - annotated
IGC 2021	2021	1,880	ABL-tagger 2.0.4	Tokenizer	Nefnir	v. 2.0	IGC 2021
IGC 2020	2020	1,532	ABL-tagger 0.9	Tokenizer	Nefnir	v. 1.0	IGC1 (843 million running words) MIM license IGC2 (712 million running words) CC BY 4.0
IGC 2018	2019	1,394	IceStagger	IceNLP	Nefnir	v. 1.0	IGC1 (799 million running words) MIM license IGC2 (595 million running words) CC BY 4.0
IGC 2017	2018	1,259	IceStagger	IceNLP	Nefnir	v. 1.0	IGC1 (716 million running words) MIM license IGC2 (543 million running words) CC BY 4.0

* Since there was newer data from 2021 available for IGC-Parla and the subcorpora IGC-Law1 and IGC-Law2 all texts from that year were included in versin 2024 for those corpora. IGG-Law3 contains the Icelandic Law Corpus in its totality and can replace IGC-Law3 from version 2022.

** There is no version with the suffix 2019 due to a change in naming conventions. The first two versions refer to the year of the most recent texts, but since then the year of publication has been used. This might have caused some confusion and if there is ever any refernece to IGC-2019 then it probably refers to IGC-2020.

Other versions:

IGC-Parl: Parliamentary speeches from 1911-2019. This is a subset of IGC 20.05, but contains additional meta-data on parliamentarians, political parties and more. IGC-Parl (219 million tokens) is published under CC BY 4.0.

Related Datasets:

IGC: Evaluation set 20.09, is manually curated to evaluate the accuracy of pos-tagging in nine different subcorpora of the IGC. IGC-evaluation set 20.09 contains 101.261 tokens and is published under the custom MIM license.

MIM-GOLD 20.05 is a gold standard for PoS-tagging Icelandic texts. It contains approximately 1 million running words with manually annotated PoS-tags. MIM-GOLD 20.05 uses a tagset revised in 2019-2020. Train/Test splits are also available. Previous versions of MIM-Gold are available, version 0.9 and 1.0.

Icelandic Tagged Corpus contains approximately 25 million running words. Further information here.

Icelandic Frequency Dictionary has been used for training and testing PoS-taggers for Icelandic since such work started. Training/Testing sets are available with various revision versions of the tagset. The current one is 20.05. Versions 18.10 and 12.11 are also available.

Using the Corpus

The corpus is available for download in a TEI-format. If you want to join the data from versions 2022 and 2024 be aware that in some cases there is an overlap since IGC-Parla and two of the subcorpora of IGC-Law (IGC-Law1 and IGC-Law2) in borh version 2022 anda 2024 contain data from the year 2021. The third subcorpora from IGC-Law (IGC-Law3) contains the whole corpus and can thus replace completely IGC-Law3 from version 2022.

The data is also available at Huggingface in a jsonl-format which is convenient when for example training large language models (as for now version 2024 is not available there).

Version 2022 is also accessible in a KWIC environment where the tags (linguistic annotation) can be used to define the search more accurately. The search gives a KWIC index and information about the source of each text example. The search interface is powered by the Swedish search interface Korp.

Word frequency information for version 2022 is available on a special website. It allows creating frequency lists based on various criteria.

N-gram frequency of up to 3-grams is also available in an n-gram viewer mode.

People

The following people have worked on the corpus:

Eiríkur Rögnvaldsson, project management
Sigrún Helgadóttir, project management and licensing
Steinþór Steingrímsson, project management, licensing, compilation and software development
Starkaður Barkarson, project management, compilation and software development
Gunnar Thor Örnólfsson, software development
Kristján Rúnarsson, software development
Hildur Hafsteinsdóttir, data collection and licensing
Þórdís Dröfn Andrésdóttir, data collection
Finnur Ágúst Ingimundarson, data collection
Árni Davíð Magnússon, data collection

Cite

If you use the Icelandic Gigaword Corpusin your published research, please cite this paper:

bib

Barkarson, Starkaður, Steinþór Steingrímsson and Hildur Hafsteinsdóttir. 2022. Evolving Large Text Corpora: Four Versions of the Icelandic Gigaword Corpus. Proceedings of the Language Resources and Evaluation Conference, pp. 2371-2381. Marseille, France.

bib

Steingrímsson, Steinþór, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. Proceedings of LREC 2018, pp. 4361-4366. Myazaki, Japan.

If you use the word frequency website or n-gram viewer please consider also citing this paper:

bib

Steingrímsson, Steinþór, Starkaður Barkarson and Gunnar Thor Örnólfsson. 2020. Facilitating Corpus Usage: Making Icelandic Corpora More Accessible for Researchers and Language Users. Proceedings of the 12th Language Resources and Evaluation Conference, pp. 3399-3405. Marseille, France.

Cooperation and Financing

Initial work on the corpus, from 2015 to 2017, was carried out at the Árni Magnússon Institute for Icelandic Studies and was funded mostly by the Infrastructure Fund (no. 151110-0031, project manager Eiríkur Rögnvaldsson) and the Contribution grants fund (Mótframlagssjóður) at the University of Iceland. Further work has been funded by Ministry of Education and Culture and the Icelandic Language Technology Programme 2019-2023.

Publishers of media content and book publishers have cooperated in collecting texts, as well as the company Creditinfo which gave assistance in retrieving texts from radio and television and some web media and printed media.