Crosslingual Annotation of Chinese and English Discourse Connectives

This data contains annotation and alignment of discourse connectives on 325 articles of the Chinese Treebank and their English translation.

The work is described in below publications:
Frances Yung, Kevin Duh, and Yuji Matsumoto. 'Sequential Annotation and Chunking of Chinese Discourse Structure', SIGHAN 2015
Frances Yung, Kevin Duh, and Yuji Matsumoto. 'Crosslingual Annotation and Analysis of Implicit Discourse Connectives for Machine Translation', DiscoMT 2015

You need the original English Chinese Translation Treebank v.1.0 released by LDC (LDC2007T02).
Annotations are presented as mapping to the raw texts of the original parallel corpus.
However, modifications to the raw texts have been made to achieve consistent annotation. (For example, some enumeration commas are replaced by normal commas.) We include a short script to modify and inject the raw texts to the annotation files based on the original LDC files. Please refer to the README file for details.

Annotation Example

ann.zh/002

ann.en/002

Meaning of tags

Please refer to the publications for details.

Tags for discourse relations (on both Chinese and English sides)
EXPLICIT: the discourse relation is signaled by an explicit discourse connective.
IMPLICIT: the discourse relation is not signaled by a discourse connective, but a discourse connective could be inserted.
REDUNDANT: the discourse relation is not signaled by a discourse connective and it is redundant to insert one.
ALTLEX: the discourse relation is signaled by an expression other than a discourse connective.


Tags for non-discourse segmenting commas (on Chinese side only)
ADVERBIAL: the comma is used to mark an adverbial clause.
OPT_COMMA: optional comma (the comma-separated segment is not a complete discourse unit.)
ATTRIBUTION: the comma is used to mark an attribution.


Attributes for all tags
id: discourse relations of the same id on the Chinese and English sides are aligned.
start/end: offset of the discourse connective/alternative expression for EXPLICIT or ALTLEX; offset of the next character after the comma for IMPLICIT/REDUNDANT.
text: actual discourse connective/alternative expression for EXPLICIT or ALTLEX; the next character after the comma for IMPLICIT/REDUNDANT.


Attributes for discourse relation tags only
DC_main: fine discourse relation sense represented by a main discourse connective.
sense: 4-way discourse relation sense category.
position: whether the discourse connective/alternative expression occurs at the initial/middle/end position within the comma-separated segment. (always initial for IMPLICIT/REDUNDANT)


download.v1.0

README