Automating the Limited Domain Synthesis Voice Creation

Building and troubleshooting your own 'Limited Domain Voice' in less than an hour


IGK 2004 Project


Proposers: Dominika Oliver, Irene Cramer
Other interested students: Sasha Calhoun, Michael Kruppa, Hannele Nicholson
Suggested Lecturers/Guests: CSTR Festival group
Time constraints:

Description

The idea of this project is to streamline the creation and identify existing problems in building limited domain synthetic voices. The task is hence twofold: we will first build a limited domain speech synthesiser based on one of participants' voices using the Festival (Black, 1998) speech synthesis toolkit with the Festvox voice-building tools. Second, a user interface will be constructed, with the aim of fully automating the voice creation process.

By limited domain speech synthesis we mean applications whose speech is constrained in a sense that they target specific vocabulary and phrases. Some common examples are systems telling the time, reading telephone numbers, sports resultsor fixed weather reports. The reason for popularity of such systems is that they are an easy to build, small version of unit selection synthesis since we have a controlled number of units. Because of this control they get the good quality of unit selection and avoid the pitfalls of more open domain, general unit selection systems (Black and Lenzo, 2000).

The creation of a limited domain synthetic voice consists of the following tasks: designing the prompts (sentences), customising the synthesizer front end , recording and autolabelling the prompts building utterance structures for recorded utterances, extracting pitchmark and building LPC coefficients, building a cluster unit selection based synthesizer from the utterances, and finally testing and tuning.

We believe these tasks can be fully automated with the help of a user interface. Within this interface, we plan to carry out a complete voice creation for a specific limited domain. The proposed system will integrate the above steps and include automatic diagnostics to target specific problems like bad recordings, misaligned automatic labels, inverted waveforms, wrong pitchmarking.

The prerequisites for the project are building a prototype system which should be done before the meeting. By this we mean an interface and a recorded/ pre-processed voice in a couple of languages (e.g. German, English, Polish). The proposed type of interface is a graphical one, in which a combination of multimodal communication with the user could be implemented (based on VoiceXML for the speech/dialogue part and Tcl/Tk for the graphical part).

The work during the Edinburgh meeting will be devoted to the analysis and identification of problems faced during the voice/interface creation part. We would like to address the issues listed above by suggesting and possibly implementing adequate solutions. By the end of the week we hope to demonstrate the impact of proper diagnostics and tuning on the resulting quality of the synthesised voice.

References

Black, A., Taylor, P., Caley, R. "The Festival speech synthesis system",, 1998.

Black, A. and Lenzo, K. (2000) "Limited Domain Synthesis", ICSLP2000, Beijing, China.

Larson, J. A.2002): VoiceXML Introduction to Developing Speech Applications. Prentice Hall.

http://www.w3.org/Voice/

http://www.voicexml.org/