Computational Linguistics & Phonetics Computational Linguistics & Phonetics Fachrichtung 4.7 Universität des Saarlandes

NLP Tools

General

  • NLTK: the Natural Language Processing Toolkit

  • WEKA: easy to use toolkit to play around with different machine learning algorithms

  • CoNLL shared task data: annotated data sets for a number of NLP tasks in a number of languages

Web Crawler

Information Retrieval

Language Identification

Pre-processing

(Sentence Splitters, Tokenisers, POS Taggers, Lemmatisers, Morphological Analysers)
  • MXTERMINATOR (via ftp) statistical sentence splitter, pre-trained for English

  • Porter Stemmer: rule-based stemming algorithm for English, implementations in numerous programming languages; adaptations for other languages than English are also available

  • Snowball stemmer An extension of the Porter stemmer that can easily be ported to languages other than English. Implementations for other languages are available from the website.

  • TreeTagger: widely used tagger with pre-trained models for a number of languages (English, German, Italian, Dutch, Spanish, Bulgarian, Russian, French and old French)

  • TnT: another well-known statistical tagger, trained models for English and German

  • Stanford POS Tagger: statistical tagger with pre-trained models for English, German, Chinese and Arabic

  • OpenNLP: toolbox for a variety of NLP tasks, pre-trained models for sentence splitting, tokenising, pos tagging for English, German, Spanish, and Thai

  • morph(a|g): morphological analyser and generator for English, distributed as part of the RASP system.

  • Morphy: morphological analyser, generator and POS tagger for German (unfortunately for Windows only)

  • Morphisto: morphological analyser and generator for German.

  • Morfette: supervised learning of inflectional morphology. Pre-trained models for Spanish and French.

Syntactic Analysis, Parsers

  • RASP: statistical parser for English, preprocessing modules (tokeniser, tagger etc.) included

  • Stanford Parser: statistical parser with pre-trained models for English, German, Arabic

  • MaltParser: well-known dependency parser, pre-trained models for English, Swedish and Chinese

  • C&C Tools: package with NLP tools, including a CCG (Combinatory Categorial Grammar) parser for English (including pre-processing modules: POS Tagging, NER, Chunking, SuperTagging)

  • Berkeley Parser: statistical parser with pre-trained models for English, German, French and Chinese

  • BitPar: PCFG parser for German (English grammar also available)

  • CDG Parser: Dependency parser for German

Text Mining / Information Extraction

(Named Entity Taggers, Co-reference and Pronoun Resolution)

Semantic Analysis

  • WordNet::Similarity: perl module that implements a number of WordNet-based semantic relatedness measures (for English). Note: semantic relatedness can also be computed from co-occurrence frequencies in unannotated corpora (distributional/vector space models). Tools for this are also available on the web (but most are a bit less user-friendly).

  • WordNet::SenseRelate: perl module for performing (WordNet-based) word sense disambiguation (for English)

  • Shalmaneser: semantic role labeller for FrameNet style semantic argument structure, pre-trained models for English and German

  • SEMAFOR: frame-semantic parser for English.

  • ASSERT: semantic role labeller for PropBank style semantic argument structure, pre-trained for English

  • SPADE: discourse parser for RST-based analyses, for English

  • Boxer: semantic analyses, produces DRSs (Discourse Representation Structures, see DRT in Wikipedia), included in the C&C package

Webpages with further information on NLP resources