Web Corpora for Information Management
We investigate the use of corpora with connectivity information (hyperlinks) for information management applications in specific domains. We will build up a web corpus for the language technology domain, which consists of a database of documents (with full-text index and meta-information) and a database of hyperlinks between documents. As a starting point for collection of the web corpus, we use the database of categorised web pages from LT-World. Information management applications include summarisation, categorisation, clustering, information extraction (discovery of relations), information retrieval, terminology extraction, and definition mining.