|
NLP Tools
General
- NLTK: the Natural Language Processing Toolkit
- WEKA: easy to use toolkit to play around with different machine learning algorithms
- CoNLL shared task data: annotated data sets for a number of NLP tasks in a number of languages
Web Crawler
Information Retrieval
Language Identification
Pre-processing
(Sentence Splitters, Tokenisers, POS Taggers, Lemmatisers, Morphological Analysers)
- MXTERMINATOR (via ftp) statistical sentence splitter, pre-trained for English
- Porter Stemmer: rule-based stemming algorithm for English, implementations in numerous programming languages; adaptations for other languages than English are also available
- Snowball stemmer An extension of the Porter stemmer that can easily be ported to languages other than English. Implementations for other languages are available from the website.
- TreeTagger: widely used tagger with pre-trained models for a number of languages (English, German, Italian, Dutch, Spanish, Bulgarian, Russian, French and old French)
- TnT: another well-known statistical tagger, trained models for English and German
- Stanford POS Tagger: statistical tagger with pre-trained models for English, German, Chinese and Arabic
- OpenNLP: toolbox for a variety of NLP tasks, pre-trained models for sentence splitting, tokenising, pos tagging for English, German, Spanish, and Thai
- morph(a|g): morphological analyser and generator for English, distributed as part of the RASP system.
- Morphy: morphological analyser, generator and POS tagger for German (unfortunately for Windows only)
- Morphisto: morphological analyser and generator for German.
- Morfette: supervised learning of inflectional morphology. Pre-trained models for Spanish and French.
Syntactic Analysis, Parsers
- RASP: statistical parser for English, preprocessing modules (tokeniser, tagger etc.) included
- Stanford Parser: statistical parser with pre-trained models for English, German, Arabic
- MaltParser: well-known dependency parser, pre-trained models for English, Swedish and Chinese
- C&C Tools: package with NLP tools, including a CCG (Combinatory Categorial Grammar) parser for English (including pre-processing modules: POS Tagging, NER, Chunking, SuperTagging)
- Berkeley Parser: statistical parser with pre-trained models for English, German, French and Chinese
- BitPar: PCFG parser for German (English grammar also available)
- CDG Parser: Dependency parser for German
Text Mining / Information Extraction
(Named Entity Taggers, Co-reference and Pronoun Resolution)
- Reconcile: state-of-the-art, out-of-the-box co-reference resolution for English
- Illinois Coreference Package for English (Java package), pre-trained models available
- BART: co-reference resolution for English (pronouns and NPs), for more details see Versley at al (2008): BART: A Modular Toolkit for Coreference Resolution, in Proc. ACL-HLT 2008 Demo Session.
- JavaRAP: Java implementation of a well-known rule-based anaphora resolution algorithm (for English). This algorithm has been adapted for German, for details see:
Holger Wunsch: Anaphora Resolution -- What Helps in German.
In: Pre-Proceedings of the International Conference on Linguistic Evidence 2006.
Tübingen, Germany, February 2-4, 2006
- HeidelTime temporal tagger (English, German).
- OpenNLP: pre-trained models for NE tagging and co-reference resolution for English
- Stanford Named Entity Recogniser: pre-trained models for English
- German Named Entity Recogniser: based on Stanford NER, developed by Manaal Faruqui and Sebastian Pado
- SemiNER - Semisupervised Named Entity Recognizer: developed by Grzegorz Chrupała and Dietrich Klakow, pre-trained models for German
- C&C NE Tagger: named entity tagger in the C&C package, pre-trained for English
- Timex taggers: tools for tagging temporal expressions, English
- Pattern: Text/Web Mining Module for Python, developed by Tom De Smedt, CLiPS, University of Antwerp
Semantic Analysis
- WordNet::Similarity: perl module that implements a number of WordNet-based semantic relatedness measures (for English). Note: semantic relatedness can also be computed from co-occurrence frequencies in unannotated corpora (distributional/vector space models). Tools for this are also available on the web (but most are a bit less user-friendly).
- WordNet::SenseRelate: perl module for performing (WordNet-based) word sense disambiguation (for English)
- Shalmaneser: semantic role labeller for FrameNet style semantic argument structure, pre-trained models for English and German
- SEMAFOR: frame-semantic parser for English.
- ASSERT: semantic role labeller for PropBank style semantic argument structure, pre-trained for English
- SPADE: discourse parser for RST-based analyses, for English
- Boxer: semantic analyses, produces DRSs (Discourse Representation Structures, see DRT in Wikipedia), included in the C&C package
Webpages with further information on NLP resources
|