Data Sets used in Sporleder and Lapata (2006, 2004)

This page lists the data sets we used in our experiments on paragraph boundary insertion. We carried out experiments for three languages (English , Greek, and German) and three domains (fiction, news, and parliamentary proceedings). On this page, we provide detailed information on the datasets we used so that they can serve as benchmark sets, enabling other researchers working on this task to compare their results with those reported in Sporleder and Lapata (2006, 2004). Note that some of the corpora we used require a licence. For more information about the data contact Mirella Lapata (mlap@inf.ed.ac.uk) or Caroline Sporleder (csporled@coli.uni-sb.de).


English


English Fiction (
BNC, licence required)

The following files were used:
English News (WSJ part of the Penn Treebank, licence required)

The following sections were used:
English Parliamentary Proceedings (BNC, Hansard section, licence required)

The following files were used:

Greek


Greek Fiction (
ECI, licence required)

The following files were used:
Greek News

Financial news from the weekly newspaper Eleftherotypia. For more information, send an email to Mirella Lapata (mlap@inf.ed.ac.uk).



Greek Parliamentary Proceedings (Greek part of the Europarl Corpus)

The following files were used:

German


German Fiction (
Project Gutenberg)

EText numbers:
German News (ECI, Frankfurter Rundschau part, licence required)

The following files were used:
German Parliamentary Proceedings (German part of the Europarl Corpus)

The following files were used:



References: