IRTG Annual Meeting 2007

Learning under Differing Training and Test Distributions

Speaker: Tobias Scheffer

Institution: MPI für Informatik

Abstract:

Most learning algorithms are constructed under the assumption that the training data is governed by the exact same distribution which the model will later be exposed to. In practice, control over the data generation process is often less perfect. Training data may consist of a benchmark corpus (e.g., the Penn Treebank) that does not reflect the distribution of sentences that a parser will later be used for. Spam filters may be used by individuals whose distribution of inbound emails diverges from the distribution reflected in public training corpora (e.g., the TREC spam corpus).

In the talk, I will analyze the problem of learning classifiers that perform well under a test distribution that may differ arbitrarily from the training distribution. I will discuss the correct optimization criterion and solutions to this problem. Empirically, it is quite common that future test data is governed by a different distribution than the training data recorded in the past. Whenever this is the case, simply training on the training data is not the best that one can do.

<< Back

Last modified: Thu, Mar 15, 2007 11:48:06 by