A Mega Corpus of Literature (Spanish|French|Portuguese)

Un méga corpus littéraire en espagnol|français|portugais | Mega Corpus literario en espanol|francés|portugués


The Megalite is a corpus of literary texts in Spanish|French|Portuguese for NLP tasks. The Megalite corpus is constituted by several million of literary texts. It is splitted in sentences and clustered in three main genres:

  1. Narrative : set of narrative texts
  2. Poetry : set of poetry texts
  3. Play : set of theater
The purpose of this corpus is become a learning corpus for algorithms processing literary text in Spanish|French|Portuguese.

Construction of the dataset

We have collected literary (poetry, narration, stories, etc.) Spanish documents. Each cluster is composed of documents of the same genre (see ours papers for more details). The corpus is coded in utf8 and clustered using an emotion code: (void, =POESIA, =TEATRO).

The corpus is composed from several files:


Lemmatized text 	

POS tagged text

n-grams tables : 1-, 2-grams and SU4 bigrams

Context using embeddings.

The corpus has been analyzed using Freeling 4.1 (about 15M per language but the number of sentences increase as the new versions go on) and embeddings pretrained. Language models using n-grams (n=1,2,SU4) are also availables.

The Megalite corpus in formats text/POS/ngrams (encoding utf8, GNU/Linux end-of-line) is distributed under LGPL license. New versions, with more literary documents will be aggregated periodically.

Télécharger le corpus littéraire Mégalite | Bajar el corpus literario Megalite | Download the Megalite corpus


Spanish Megalite corpora (5075 docs, 1336 authors, 15M sentences, 212M words)

French Megalite corpora (x docs, x authors, xM sentences, xM words; text lemmatized; tagged Freeling 4.1; n-gramms tables)

Portuguese Megalite corpora (x docs, x authors, xM sentences, xM words; text lemmatized; tagged Freeling 4.1; n-gramms tables)


Would you like to collaborate with the Megalite project? Mistakes in the corpus? Please contact us.

How to cite this corpora?

  • Moreno & Torres-Moreno, Megalite: A new Spanish Literature. Corpus for NLP tasks, CS & IT - CSCP 2021
  • Moreno & Torres-Moreno, MegaLite-2: An Extended Bilingual Comparative Literary Corpus, Intelligent Computing pp 1014–1029, 2021
    Contact : Juan-Manuel Torres & Luis-Gil Moreno Jiménez
    http://lia.univ-avignon.fr / Universite d'Avignon, France
    juan *-* manuel *dot* torres *at* univ-avignon *dot* fr | luis-gil *dot* moreno-jimenez *at* univ-avignon *dot* fr
    Updated 2022.April.26