A Computational Model of Target Oriented Production of
Prosody
(short title: Production of Prosody)
DFG Grant to G. Dogil & B. Möbius: 1 Oct 2001 - 30 Sep 2004
Project description
The two main goals of the proposed project are, first, to establish a
novel paradigm of research into the production of prosody and, second,
to provide experimental evidence for several assumptions made by the
computational model underlying the proposed approach. Our approach is
inspired by the speech production model recently proposed by Frank
Guenther, Joe Perkell, and colleagues. Their model posits that speech
production is constrained by auditory and perceptual requirements. The
only invariant targets of the speech production process are auditory
perceptual targets. The targets are characterized as multidimensional
regions in the perceptual space, and speech movements are trajectories
planned to traverse the target regions. Our project rests on the
assumption that these statements hold for the production of prosody as
well.
In the framework of the proposed research project a computational
prosody model will be implemented that has its motivation both in the
theory of speech production and in linguistic theory. The
computational model is intended to serve two main purposes. First, it
will allow us to empirically test a number of assumptions made by the
production model, for instance the effect of speaking rate and other
factors related to speech timing on the acoustic realization of
intonational gestures. Second, the linguistically based classification
of intonational events, e.g. those related to (a) discourse structure
(register, pitch range), (b) information structure (topic, focus), and
(c) accentual patterns (pitch accents, tones, tunes), can be
experimentally tested by trainable intonation event classifiers. A
neural network architecture can learn mappings between reference
frames (the perceptual target regions) and speech/intonational
gestures. In accordance with the postural relaxation hypothesis we
expect such a learned neural mapping, based on adjustable adaptive
weights, to tend consistently towards comfortable realization
configurations, either under the influence of temporal constraints or
as part of a tune (a coherent sequence of intonational events), as
long as the perceptual target region is traversed.