Computational Linguistics & Phonetics Computational Linguistics & Phonetics Fachrichtung 4.7 Universität des Saarlandes

Software and Data

This page documents Software and Data referred in our publications. See the respective papers for more details.

Script Data

We plan to distribute the data we collected in the 2010 experiment as a corpus. If you are interested in similar data, we refer you to the OMICS corpus from MIT. Presently, the following materials have been made available: [RKP Scenario Files], [RKP Data], [Stories Data].

Induction of Neural Models of Script Knowledge

The provided [data] includes both development (dev) data as well as test data. Each of these two directories contains different script scenarios located in respective sub-directories, e.g. the script scenario for preparing coffee can be found in the sub-directory test/coffee. Please consult the readme for further information.

The TACoS Corpus

The Saarbrücken Corpus of Textually Annotated Cooking Scenes (short: TACoS), described in our paper "Grounding Action Description in Videos", has its own homepage. It contains textual descriptions of kitchen-related videos, in which each sentence is temporally aligned to the video segment it describes.

Other Experimental Data

The data we used for our experiments on paraphrase extraction from standard texts contain our gold standard and urls to the source data (see the paper from EMNLP 2012 for more details). All files are *.csv files, you can import them e.g. with Excel, Google Docs or Open Office:
  • The [Source Data URLs] contain a list of links to summaries for TV show Episodes from House, M.D.. For each episode of season 6, we used 8 different URLs. In the file, you can find one row for each episode, and 9 columns (for the episode number and the 8 summaries). In order to process the text visible in the URLs, you can either try to find a good URL scraper or copy the texts manually (which takes 1-2 hours for the whole season). For preprocessing, we used splitta, an openly available sentence splitter by Dan Gillick.
  • The [Gold Standard for Sentence Matching] contains a table with 2000 sentence pairs, labelled with paraphrase information: Sentences can be either paraphrases (have the same meaning), have a containment or backwards_containment relation (sentence 1 contains more information than sentence 2, or the other way round), be related (the sentences intersect but both have exclusive additional information) or unrelated.
  • The [Gold Standard for Paraphrase Fragments] contains 5 * 120 examples for the 5 phrasal fragment extraction methods we applied to the sentential paraphrases. We used the same labeling scheme that we applied to the sentential paraphrases (except that unrelated) is labelled irrelevant, because the other label seemed misleading in the annotation context). Additionally, we provide the sentence pair from which we extracted the respective fragment pair, and indicate which system pipeline lead to the extracted fragment.

Alignment Software

For our experiments published in 2010 and 2011, we used a customized version of the Needleman-Wunsch Algorithm for Multiple Sequence Alignment [*] (MSA). In contrast to the many common tools for MSA used by people from Bioinformatics, our implementation allows for a customizable alphabet and a dyncamic scoring function (computing similarity of two elements on the fly). It is not restricted to strings (or event descriptions) as elementary units, but can in principle be adapted to any type of object by defining a similarity function for it. The tool is written in Java and comes with a GUI that helps to inspect alignment tables and tune the similarity functions and gap scores. We did not release the Code officially (yet). A beta-beta version is available upon request (with the usual drawbacks :) ).

[*] Good introductions are on the Wikipedia pages for MSA and the Needlemann-Wunsch Algorithm