This is the README file accompanying release 1.0 of the Training data for the PropBank/NomBank version of SEMEVAL-2010 Task 10. This document was created on 11 January 2010 by Josef Ruppenhofer. It was last modified on 21 January 2010 by Josef Ruppenhofer ------------------- Table of Contents 1. Training data contents 2. Formats 3. Further info 4. Citing this dataset ------------------- = 1. What the training data release contains = The text that serves as training data is taken from Arthur Conan Doyle's "The Adventure of Wisteria Lodge". Out of this lengthy, two-part story we annotated the second part, titled "The Tiger of San Pedro". This text is not subject to copy-right and was taken directly from the web. In what follows we sometimes refer to the training data by the shorthand "Tiger" or "Tiger annotations". The propbank format of the data was automatically derived by running the conversion code contained in the archive semeval10-v7.zip (in the software subdirectory) on the Salsa/Tiger XML version of the FrameNet data. The tiger subdirectory contains both the Propbank format (Tiger.PBformat) and the FrameNet file it is based on (TigerOfSanPedro.withHeads.xml). The mappings used by the code reside in the subdirectory mappingsToFrameNet. They are based on semlink mappings and were mainly worked out by Russell Lee-Goldman, with some input by Josef Ruppenhofer. The underlying FrameNet annotation was carried out on top of a constituency-parse tree generated by the Shalmaneser tool. Shalmaneser internally calls the Collins parser. We accepted the automatic parses and performed no corrections. The tool used to carry out the annotation is Salto. Shalmaneser and Salto are available from this site: http://www.coli.uni-saarland.de/projects/salsa/page.php?id=software Salto is described in this paper by Erk et al.: http://www.coli.uni-saarland.de/projects/salsa/papers/lrec06-tool.pdf The annotations were carried out by two experiences FrameNet annotators (Josef Ruppenhofer, Russell Lee-Goldman) as follows. A first pass of both frame-semantic and coreference annotation was carried out by one annotator. This first pass of annotation was then checked by the second annotator and all divergences were adjudicated by both annotators. In the final step, both annotators jointly performed the null-instantiation resolution. The frame inventory used for the training data is that of FN release 1.4 alpha, rather than that of the last official release, FN r1.3. The annotation schemes for the frame semantic, coreference and NI-resolution annotations are documented in the file annotation_guidelines.pdf. = Formats = * The Tiger training data from Conan Doyle is available in a conll-inspired format. * The 9 columns we use are ordered in the following way: sentence id, token id, word, lemma, pos, headless syntax, syntax with heads, local roles, non-local roles * The following example displays sentence 9 of the text. 9 1 " " PUNC`` (S(SBAR(WHNP (S:10(SBAR:2(WHNP:2 _ _ 9 2 What What WP *) *) _ _ 9 3 's be VBZ (VP (VP:3 _ _ 9 4 the the DT (NPB (NPB:5 _ _ 9 5 matter matter NN * * _ _ 9 6 , , PUNC, *))) *))) _ _ 9 7 Walters Walters NNP (NPB (NPB:7 coref.01{A0_OVE=(s9_7)} coref.01{A1_OVE=(s8_18)} 9 8 ? ? PUNC. * * _ _ 9 9 " " PUNC'' *) *) _ _ 9 10 asked ask VBD (VP (VP:10 ask.01{A0_OVE=(s9_11);A1_OVE=(s9_1,s9_2,s9_3,s9_4,s9_5,s9_6,s9_7,s9_8,s9_9);A2_DNI=(s9_7)} ask.01{} 9 11 Baynes Baynes NNP * * coref.01{A0_OVE=(s9_11)} coref.01{A1_OVE=(s6_10)} 9 12 sharply sharply RB (ADVP (ADVP:12 _ _ 9 13 . . PUNC. *))) *))) _ _ * Note the following: ** In the syntax with heads column, the head of each non-terminal is added to the phrase type label with a colon as separator. For instance, the head of the S(entence) that opens on token 1 is token 10, which is the main verb "asked". Similarly, the head of the noun phrase that begins with token 4, is token 5. ** Coreference annotation is provided as an "honorary" frameset coref. As with regular framesets, the local arguments appear in the 8th column and the non-local ones in the 9th. The line for token 7 "Walters" shows that there is an earlier coreferent mention of this referent in sentence 8, namely token 18 there. Since the antecedent is in a different sentence, it is captured in the non-local column. ** Arguments are represented as the set of terminals they cover. For instance, argument A0 of the ask.01 frameset on token 10 covers terminal 11, "Baynes". Argument 1 of the same predicate covers terminals 1 through 9, that is the whole stretch of direct speech including the quote symbols. ** Arguments carry either the marking OVE for "overt" or DNI for "definite null instantiation". Where appropriate, the terminals of an antecedent that explicitly refers to the correct filler of the role are given as the resolution of a DNI-argument. For instance, argument 2 of "asked", the addressee of the question, is not expressed as a syntactic argument of ask but is understood to be the person Walters addressed in the direct quote. This is captured through the notation A2_DNI=(s9_7). = Further info = * Propbank can be downloaded here: http://verbs.colorado.edu/verb-index/ * If desired, additional training data can be generated by running the conversion software (in the software directory) on the FrameNet 1.4 alpha release data that is included in the FrameNet training data of our task (availale from the Semeval web site). * The evaluation script for the full task and the NI resolution task is NOT included here. At the time of this release, the script is still being tested. We will announce its availability on the google group that we created for task participants: http://groups.google.com/group/semeval2010-task10?pli=1 We also maintain a page on Task 10 at Saarland University. http://www.coli.uni-saarland.de/projects/semeval2010_FG/ * If you find any errors or find files to be missing, please get in touch via the google group or by writing directly to {josefr}_A@T_coli.uni-sb.de. * Conan Doyle's text is British English whereas most of the FrameNet data is American English. In working with the British data, you may need to take into account the spelling differences between the varieties (e.g. colour versus color). Also, Doyle uses some now obsolete spellings such as to-night for "tonight". = Citing this dataset = If you make use of these data for purposes other than participation in the SemEval 2010 shared task "Linking Events and their Participants in Discourse" we would kindly ask you to refer to the following paper: Josef Ruppenhofer, Caroline Sporleder, Roser Morante, Collin Baker and Martha Palmer. "SemEval-2010 Task 10: Linking Events and Their Participants in Discourse". The NAACL-HLT 2009 Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-09), Boulder, Colorado, USA, June 4, 2009.