Language and Computation
Advanced course


Dept. of Mathematics and Computer Science, Bar Ilan University

First week
Course description

Statistical information at the lexical level was found useful for many applications in natural language processing (NLP). The common types of information used are frequencies of words (lemmas, terms) in corpora and documents, and co-occurrence frequencies of words with other words or with keyword categories.

These statistics are used for supervised classification problems (word sense disambiguation, target word selction, PP-attachment, spelling correction, text categorization, language modeling), for unsupervised applications (term extraction, statistical word similarity, word and document clustering), and for combinations of the two (similarity and cluster based methods for language modeling and disambiguation).

The course will try to provide a unifying picture of the methods which are applied to lexical statistics in the different applications. We will emphasize common aspects of different applications and common goals and structures of various computational methods. The course will cover statistical and computational issues such as testing statistical significance, smoothing, information-theoretic measures, bayesian-inference, vector-based representations and metrics and simple neural-network style techniques (a preliminary list). We will also show how statistical information can be beneficially combined with symbolic information, such as the output of a grammatical parser.

The course will be at an intermediate level. It will describe basic versions of the methods above, focusing on the rationale behind them rather than on very deep technicalities.

While not covering all areas of statistical language processing, the course can serve as a good introduction for the approach taken by statistical NLP methods. It will also be useful for students familiar with some parts of statistical NLP that would like to get a global view of lexical statistical methods and their applications.

The course should be comprehensible for students with basic understanding of probability theory or statistics, and computation (at the level of an introductory course in each discipline, or equivalent).
Literature No specific recommendation