Corpus PUCES de Résumé Automatique de Documents

The PUCES corpus of Automatic Text Summarization | El corpus PUCES de resumen automatico de textos


"Informatique et puces. Et si l'ordinateur pouvait fonctionner un jour, sans électricité ou presque?... Des piqures de puces ont été diagnostiquées sur plus d'un tiers des 155 hommes de la compagnie IV de l'école de recrues d'infanterie d'exploration et de transmission 213"

The French corpus PUCES contains a source document (mix of Le Monde 12/05/1999 and Confoederatio Helvetica communiqué 1997/Sep/10), a set (the summaries) of 291 extracts created by humans annotators and 16 systems, and a set of 186 human abstracts (the references). The source document have 30 sentences that belong to 2 topics: a new electronic microchips for computers and a fleas invasion in a military company. The topic mixture is due to the polysemy of the French word "puces": one sense corresponds to microchips, while the other corresponds to fleas.

The reference corpus (the abstracts) was manually generated by a lot of persons: our undergraduate and graduate students in Computer Science or Computer Engineering, in Québec (Canada) and France (Universités du Québec UAQC and UQAM, Polytechnique Montréal and Université d'Avignon).

The protocol for producing abstracts and extracts was always the same under the same conditions: 10 minutes for reading and selecting 8 relevant sentences (extract) and 10 minutes for writing the corresponding abstract.

This corpus is well suitable for evaluate automatic text summarization systems (with ou without references) like SummTriver, FRESA, SIMetrix or ROUGE. New versions, with more human extracts and references, will be aggregated periodically.

A "democratic" scoring of sentences of PUCES document (without accents) is available here.

The PUCES corpus (in utf8 format) is distributed under LGPL license.



How to cite this corpus ?

Contact : Juan-Manuel Torres & Luis Adrian Cabrera Diego
http://lia.univ-avignon.fr / Universite d'Avignon, France
juan *-* manuel *dot* torres *at* univ-avignon *dot* fr
adrian *dot* 9819 *at* gmail *dot* com
Updated 2020.10.10