Dept. of Mathematics and Computer Science, Bar Ilan University Statistical information at the lexical level was found useful
for many applications in natural language processing (NLP). The
common types of information used are frequencies of words (lemmas,
terms) in corpora and documents, and co-occurrence frequencies
of words with other words or with keyword categories. These statistics are used for supervised classification problems
(word sense disambiguation, target word selction, PP-attachment,
spelling correction, text categorization, language modeling),
for unsupervised applications (term extraction, statistical word
similarity, word and document clustering), and for combinations
of the two (similarity and cluster based methods for language
modeling and disambiguation). The course will try to provide a unifying picture of the methods
which are applied to lexical statistics in the different applications.
We will emphasize common aspects of different applications and
common goals and structures of various computational methods.
The course will cover statistical and computational issues such
as testing statistical significance, smoothing, information-theoretic
measures, bayesian-inference, vector-based representations and
metrics and simple neural-network style techniques (a preliminary
list). We will also show how statistical information can be beneficially
combined with symbolic information, such as the output of a grammatical
parser. The course will be at an intermediate level. It will describe
basic versions of the methods above, focusing on the rationale
behind them rather than on very deep technicalities. While not covering all areas of statistical language processing,
the course can serve as a good introduction for the approach taken
by statistical NLP methods. It will also be useful for students
familiar with some parts of statistical NLP that would like to
get a global view of lexical statistical methods and their applications.
LEXICAL STATISTICAL METHODS FOR NATURAL LANGUAGE PROCESSING
dagan@cs.biu.ac.il
The course should be comprehensible for students with basic understanding
of probability theory or statistics, and computation (at the level
of an introductory course in each discipline, or equivalent).
Literature
No specific recommendation