Computational Linguistics & Phonetics Computational Linguistics & Phonetics Fachrichtung 4.7 Universität des Saarlandes

Computational Linguistics Colloquium

Thursday, 10 May 2012, 16:15
Seminar Room, Building C7.4

Note the unusual place!

Distributional semantics with eyes: enriching corpus-based models of word meaning with automatically extracted visual features

Marco Baroni
Language, Interaction and Computation Laboratory
Center for Mind/Brain Sciences
University of Trento

The last few decades have seen great progress in the automated extraction of semantic information from large collections of text (corpora). Successful methods to accomplish tasks such as synonym detection or determining the selectional preferences of verbs rely on some version of the distributional hypothesis, that is, the idea that the meaning of a word can be approximated by the set of linguistic contexts in which it occurs. On closer inspection, the distributional hypothesis is making two separate claims: 1) that meaning is approximated by context, and 2) that we can limit ourselves to linguistic contexts. The latter restriction has probably been adopted more out of necessity than out of theoretical beliefs: It is easy to extract the linguistic contexts in which a word occurs from corpora, whereas, until recently, it was not clear how other kinds of contextual information could be harvested on a large scale.

But this has changed: Thanks to the Web, we now have access to huge amounts of multimodal documents where words co-occur with images (tagged Flickr pictures, illustrated news stories, YouTube videos...). And thanks to progress in computer vision, we can represent images in terms of automatically extracted discrete features, that can in turn be treated as visual collocates of the words associated with the images, enriching the distributional representation of words with visual information. In this talk, I will introduce our ongoing work on building multimodal distributional semantic models by combining textual and visual collocates, and report the results of various experiments that show that visual information is nicely complementing text-derived linguistic information, leading to more "grounded" distributional models of semantics, that might be better equipped to simulate human-like semantic behaviour.

If you would like to meet with the speaker, please contact Manfred Pinkal.