<< Crosslingual annotation of discourse connectives on Chinese and English translation corpus >>

This data contains annotation and alignment of discourse connectives on 325 articles of the Chinese Treebank and their English translation.
The work is described in the following papers:
Frances Yung, Kevin Duh, and Yuji Matsumoto. 'Sequential Annotation and Chunking of Chinese Discourse Structure', SIGHAN 2015
Frances Yung, Kevin Duh, and Yuji Matsumoto. 'Crosslingual Annotation and Analysis of Implicit Discourse Connectives for Machine Translation', DiscoMT 2015

Annotations are presented as mapping to the raw texts of the original data released by LDC.
However, modifications to the raw texts have been made to achieve consistent annotation.
(For example, some enumeration commas are replaced by normal commas.)
We include a short script to modify and inject the raw texts to the annotation files based on the original LDC files.   

1. Content
	ann.zh: Chinese side annotation	
	ann.en: English side annotation
	edits.zh: edit operations to modify Chinese raw texts
	edits.en: edit operations to modify English raw texts

2. get_raw.py
	The python Levenshtein package is required.
	Please download from: https://pypi.python.org/pypi/python-Levenshtein/0.12.0

	Usage:
	python get_raw.py [path to LDC2005T01/DATA/RAW] [path to LDC2007T02/ectb_v1/data/rawtext-files]