The Megalite is a corpus of literary texts in Spanish|French|Portuguese for NLP tasks. The Megalite corpus is constituted by several million of literary texts. It is splitted in sentences and clustered in three main genres:
We have collected literary (poetry, narration, stories, etc.) Spanish documents. Each cluster is composed of documents of the same genre (see ours papers for more details). The corpus is coded in utf8 and clustered using an emotion code: (void, =POESIA, =TEATRO).
The corpus is composed from several files:
Lemmatized text
POS tagged text
n-grams tables : 1-, 2-grams and SU4 bigrams
Context using embeddings.
The corpus has been analyzed using Freeling 4.1 (about 15M per language but the number of sentences increase as the new versions go on) and embeddings pretrained. Language models using n-grams (n=1,2,SU4) are also availables.
The Megalite corpus in formats text/POS/ngrams (encoding utf8, GNU/Linux end-of-line) is distributed under LGPL license. New versions, with more literary documents will be aggregated periodically.
Spanish Megalite corpora (5075 docs, 1336 authors, 15M sentences, 212M words)
French Megalite corpora (x docs, x authors, xM sentences, xM words; text lemmatized; tagged Freeling 4.1; n-gramms tables)
Portuguese Megalite corpora (x docs, x authors, xM sentences, xM words; text lemmatized; tagged Freeling 4.1; n-gramms tables)
How to cite this corpora?