Modelling prosody for speech synthesis: example from Polish
Dominika Oliver

During the talk I will present the issues encountered during the process of intonation prediction and generation in a text to speech system on the example from Polish.

The prosodic analysis in speech synthesis usually involves the modelling of various components: segmental duration, division into prosodic phrases, stress and accent place assignment, modelling different accent/boundary tone types as well as F0 contour generation. Each of these plays a role in generating natural sounding speech, essential but not a trivial task for any text to speech system.

The prosody generation implementation discussed here concentrates on two components, accent type and F0 prediction, a process, carried out in two stages using machine learning techniques:
-prediction of accent placement and accent type using classification and regression trees
-prediction/generation of F0 contour using linear regression

In this study, based on a speech database PoInt, the analysis of the acoustic parameters characterising accent types in Polish has been performed in which features characteristic for each accent type were derived. Additionally, accent type study involved classification of contour types using machine learning techniques, especially neural networks and hierarchical clustering.

In the process of current work, both prediction of accent placement and accent type as well as prediction/generation of F0 contour has been implemented in Festival TTS system. The accent types classified by ML methods serve as input to Festival's prosody prediction/generation module using language specific features. The evaluation of classification process, symbolic results from prosody prediction and generation, and future work will also be presented.