Computational Linguistics & Phonetics Computational Linguistics & Phonetics Fachrichtung 4.7 Universität des Saarlandes
Simon Ostermann's Homepage

Data

InScript

[Data | Paper by Modi et al.]
1000 narrative texts crowdsourced via Amazon Mechanical Turk. The texts cover 10 different scenarios describing everyday situations like taking a bath, baking a cake etc. and are annotated with script information (events and participants labels), as well as coreference chains linking different mentions of the same entity within the document.

MCScript

[Data | Paper by Ostermann et al.]
A Machine Comprehension Dataset of 2100 narrative texts and 14,000 questions crowdsourced via Amazon Mechanical Turk. The texts and questions cover 110 everyday scenarios. Approx. a third of the questions require commonsense knowledge or script knowledge for finding the correct answer.

MCScript2.0

[Data | Paper by Ostermann et al.]
A machine comprehension corpus for the end-to-end evaluation of script knowledge. It contains approx. 20,000 questions on approx. 3,500 texts, crowdsourced based on a new collection process that results in challenging questions. Half of the questions cannot be answered from the reading texts, but require the use of commonsense and, in particular, script knowledge. The task is not challenging to humans, but existing machine comprehension models fail to perform well on the data, even if they make use of a commonsense knowledge base.The data set is used for a shared task at the COIN workshop.

DeScript

[Data | Paper by Wanzare et al.]
A corpus of event sequence descriptions (ESDs) for different scenarios crowdsourced via Amazon Mechanical Turk. It has 40 scenarios with approximately 100 ESDs each. The corpus also has partial alignments of event descriptions that are semantically similar with respect to the given scenario.

All data sets can also be found on the official SFB resources page.

Code

Find me on GitHub and on our SFB-internal GitLab page!