Computational Linguistics Colloquium

Distinguished Speakers in Language Science

Wednesday, 26 November 2014, 16:15
Conference Room, Building C7.4
Unusual date!

How Humans (and Machines) Integrate Language and Vision: Image Description as a Test Case

Frank Keller
School of Informatics
University of Edinburgh

Joint work with Moreno Coco and Des Elliot

When humans process text or speech, this often happens in a visual context, e.g., when listening to a lecture, reading a map, or describing an image. Here, we focus on image description as an example of language/vision integration. Previous research has shown that objects in a visual scene are fixated before they are mentioned, leading us to hypothesize that the scan pattern of a participant can be used to predict what they will say. We test this hypothesis using a data set of cued scene descriptions of photo-realistic scenes. We demonstrate that similar scan patterns are correlated with similar sentences and that this correlation holds for three phases of language production (target identification, sentence planning, and speaking). We go on to show how insights from human language/vision integration can be used to build systems that automatically describe images. We propose a novel way of representing images as visual dependency graphs, where arcs between image regions are labeled with spatial relationships. The task of relating image regions to each other can then be viewed as a parsing task. We show how image parsing can be automated and how the output of an image parser can be used to generate image descriptions. The resulting system outperforms standard approaches that rely on object proximity or corpus information to generate descriptions.

If you would like to meet with the speaker, please contact Vera Demberg.