DAGAN

Language and Computation
LEXICAL STATISTICAL METHODS FOR NATURAL LANGUAGE PROCESSING

Advanced course

IDO DAGAN

Dept. of Mathematics and Computer Science, Bar Ilan University

First week
dagan@cs.biu.ac.il

Course description

Statistical information at the lexical level was found useful for many applications in natural language processing (NLP). The common types of information used are frequencies of words (lemmas, terms) in corpora and documents, and co-occurrence frequencies of words with other words or with keyword categories.

These statistics are used for supervised classification problems (word sense disambiguation, target word selction, PP-attachment, spelling correction, text categorization, language modeling), for unsupervised applications (term extraction, statistical word similarity, word and document clustering), and for combinations of the two (similarity and cluster based methods for language modeling and disambiguation).

The course will try to provide a unifying picture of the methods which are applied to lexical statistics in the different applications. We will emphasize common aspects of different applications and common goals and structures of various computational methods. The course will cover statistical and computational issues such as testing statistical significance, smoothing, information-theoretic measures, bayesian-inference, vector-based representations and metrics and simple neural-network style techniques (a preliminary list). We will also show how statistical information can be beneficially combined with symbolic information, such as the output of a grammatical parser.

The course will be at an intermediate level. It will describe basic versions of the methods above, focusing on the rationale behind them rather than on very deep technicalities.

While not covering all areas of statistical language processing, the course can serve as a good introduction for the approach taken by statistical NLP methods. It will also be useful for students familiar with some parts of statistical NLP that would like to get a global view of lexical statistical methods and their applications.

Prerequisites
The course should be comprehensible for students with basic understanding of probability theory or statistics, and computation (at the level of an introductory course in each discipline, or equivalent).

Literature No specific recommendation

HOME PROGRAMME CONTACT REGISTRATION

Language and Computation	LEXICAL STATISTICAL METHODS FOR NATURAL LANGUAGE PROCESSING
Advanced course	IDO DAGAN Dept. of Mathematics and Computer Science, Bar Ilan University
First week	dagan@cs.biu.ac.il
Course description	Statistical information at the lexical level was found useful for many applications in natural language processing (NLP). The common types of information used are frequencies of words (lemmas, terms) in corpora and documents, and co-occurrence frequencies of words with other words or with keyword categories. These statistics are used for supervised classification problems (word sense disambiguation, target word selction, PP-attachment, spelling correction, text categorization, language modeling), for unsupervised applications (term extraction, statistical word similarity, word and document clustering), and for combinations of the two (similarity and cluster based methods for language modeling and disambiguation). The course will try to provide a unifying picture of the methods which are applied to lexical statistics in the different applications. We will emphasize common aspects of different applications and common goals and structures of various computational methods. The course will cover statistical and computational issues such as testing statistical significance, smoothing, information-theoretic measures, bayesian-inference, vector-based representations and metrics and simple neural-network style techniques (a preliminary list). We will also show how statistical information can be beneficially combined with symbolic information, such as the output of a grammatical parser. The course will be at an intermediate level. It will describe basic versions of the methods above, focusing on the rationale behind them rather than on very deep technicalities. While not covering all areas of statistical language processing, the course can serve as a good introduction for the approach taken by statistical NLP methods. It will also be useful for students familiar with some parts of statistical NLP that would like to get a global view of lexical statistical methods and their applications.
Prerequisites	The course should be comprehensible for students with basic understanding of probability theory or statistics, and computation (at the level of an introductory course in each discipline, or equivalent).
Literature	No specific recommendation