The Portuguese/Spanish corpus of Multi-Sentence Fusion | Corpus portugais/espagnol de Fusion Multi-phrases

The corpora are split into three directories :

src : sentence clusters in raw and tokenized formats

ref : manual compressions to be used for ROUGE/BLEU automatic evaluation

pos : tokenized and Part-Of-Speech tagged sentences (using TreeTagger Pos-tagger)

# Construction of the dataset

We collected links from Google News in Portuguese and Spanish between July and September 2016. These links redirect international news sites in Spanish (La Jornada, Milenio, El Economista, BBC Mundo, El Colombiano, El Paı́s, CNN en español, etc.) and in Portuguese (G1, Uol Notı́cias, Estadão, O Globo, etc.). Each cluster is composed of related sentences describing a specific event about Science, Sports, Economy, Health, Business, Technology, Accidents/Catastrophes, General Information and other subjects (see our paper for more details).

Page mise à jour le 10 mars 2018