Georgian National Corpus

Georgian National Corpus

Institute(s): Goethe-Universität Frankfurt

Duration: until 2019

CEDIFOR project participants

Prof. Dr. Jost Gippert (Institute for Empirical Linguistics, Goethe University Frankfurt)
Prof. Dr. Manana Tandaschwili (Institute for Empirical Linguistics, Goethe University Frankfurt)

Brief description

The project “Georgian National Corpus” aims at creating a comprehensive annotated corpus of the Georgian language, which documents this language in a balanced way in its diachronic and synchronous diversity and makes it available for scientific research in different directions (linguistics, literature, history, social sciences, political sciences, etc.). In the meantime, 6 partial corpora with about 200 million tokens (completely lemmatized and morphologically annotated) are available on a dedicated server ( with various search functions:

GNC Old Georgian (approx. 4.5 mio. tokens)

GNC Middle Georgian (approx. 1.2 million tokens)

GNC Modern Georgian (approx. 600,000 tokens)

GRC: Georgian Reference Corpus (approx. 183 million tokens)

GDC: Georgian Dialect Corpus (about 1.7 million tokens)

SSGG: Audiovisual Text Materials (approx. 150,000 tokens)

At present, work is mainly being done on the further disambiguation of annotations and on the creation of thematic corpora.


  • Gippert, Jost / Tandashvili, Manana: Structuring a diachronic corpus. The Georgian National Corpus project; in: Gippert, Jost / Gehrke, Ralf (eds.), Historical Corpora. Challenges and Perspectives (Korpuslinguistik und interdisziplinäre Perspektiven auf Sprache / Corpus Linguistics and Interdisciplinary Perspectives on Language, 5), Tübingen: Narr 2015, 305-322.