A Computational Model of Target Oriented Production of Prosody

(short title: Production of Prosody)

DFG Grant to G. Dogil & B. Möbius: 1 Oct 2001 - 30 Sep 2004

Project description

The two main goals of the proposed project are, first, to establish a novel paradigm of research into the production of prosody and, second, to provide experimental evidence for several assumptions made by the computational model underlying the proposed approach. Our approach is inspired by the speech production model recently proposed by Frank Guenther, Joe Perkell, and colleagues. Their model posits that speech production is constrained by auditory and perceptual requirements. The only invariant targets of the speech production process are auditory perceptual targets. The targets are characterized as multidimensional regions in the perceptual space, and speech movements are trajectories planned to traverse the target regions. Our project rests on the assumption that these statements hold for the production of prosody as well.

In the framework of the proposed research project a computational prosody model will be implemented that has its motivation both in the theory of speech production and in linguistic theory. The computational model is intended to serve two main purposes. First, it will allow us to empirically test a number of assumptions made by the production model, for instance the effect of speaking rate and other factors related to speech timing on the acoustic realization of intonational gestures. Second, the linguistically based classification of intonational events, e.g. those related to (a) discourse structure (register, pitch range), (b) information structure (topic, focus), and (c) accentual patterns (pitch accents, tones, tunes), can be experimentally tested by trainable intonation event classifiers. A neural network architecture can learn mappings between reference frames (the perceptual target regions) and speech/intonational gestures. In accordance with the postural relaxation hypothesis we expect such a learned neural mapping, based on adjustable adaptive weights, to tend consistently towards comfortable realization configurations, either under the influence of temporal constraints or as part of a tune (a coherent sequence of intonational events), as long as the perceptual target region is traversed.