Computational Linguistics Colloquium

Thursday, 3 July, 16:15
Conference Room, Building C7 4

Multimodal Statistical Learning: Linking Words to World

Chen Yu
Indiana University

There are an infinite number of possible word-to-world pairings in naturalistic learning environments. Previous proposals to solve this mapping problem focus on linguistic, social, and representational constraints at a single moment. We examined an alternative account -- a cross-situational learning strategy based on computing distributional statistics across words, across referents, and most importantly across the co-occurrences of these two at multiple moments. We briefly exposed human learners to a set of trials that each contained multiple spoken words and multiple pictures of individual objects; no information about word-picture correspondences was given within a trial. Nonetheless, over trials, subjects learned the word-picture mappings through cross-trial statistical relations. The remarkable performance of adult and child learners in various learning conditions suggests that they calculate cross-trial statistics with sufficient fidelity and by doing so rapidly learn word-referent pairs even in highly ambiguous learning contexts
Moreover, we suggest that another important aspect of word learning is an understanding of the mechanisms through which word learning is grounded in sensorimotor experience, in the physical regularities of the world, and in the time-locked and coupled multimodal interactions between the child's own actions and the actions of their caregivers. We designed and implemented a novel multimodal sensing environment consisting of two head-mounted mini cameras that are placed on both the child's and the parent's foreheads, motion tracking of head movements and recording of caregiver's speech. Using this new technology, we captured the dynamic visual information from both the learner's perspective and the parent's viewpoint while they were engaged in a naturalistic toy-naming interaction, to study the regularities and dynamic structure in the multimodal learning environment. Our results show that a wide range of perceptual and motor patterns, such as the proportion of the named objects in both the child's and the caregiver's visual fields, the proportion of time that the child's hands are holding the named objects when those names are uttered, and as well as the child's head movements, are predictive of successful word learning through social interaction.

If you would like to meet with the speaker, please contact Berry Claus.