(advanced course, lecture, 2 SWS, 3 LP/ECTS, LSF 50123)
Wed 16.15-17.45, E1.3 / 0.01 (HS I)
None, but some background in computational linguistics, speech science, or signal processing will be advantageous.
Written exam at the end of the winter semester (Feb 9 and Mar 16, 2011).
Speech synthesis is an essential component of any system relying on intuitive human-machine communication. Speech synthesis systems are also used in phonetic research to gain further insight into speech production and acoustic properties of speech. This advanced course offers an introduction to text-to-speech (TTS) synthesis systems and strategies. Various approaches to speech synthesis are presented, including formant synthesis, concatenative synthesis, and state-of-the-art corpus-based unit selection synthesis. Linguistic text analysis and natural language processing modules typically included in TTS systems are covered as well.
Date | Topic | Slides | Assignment |
27.10. | Introduction Synthesis strategies |
ppt pdf |
Taylor 2009, ch. 1 and 2 |
03.11. | Components of TTS systems | ppt pdf | Taylor 2009, ch. 3.1-3.4 |
10.11. | TTS systems: Mary (Steiner) | ||
17.11. | TTS systems: Festival | ppt pdf | Clark et al. 2007 |
24.11. | Formant synthesis Articulatory synthesis |
ppt pdf | Taylor 2009, ch. 13 |
01.12. | Concatenative synthesis I | ppt pdf | Taylor 2009, ch. 14 |
08.12. | Concatenative synthesis II | Taylor 2009, ch. 14 | |
15.12. | Unit Selection synthesis |
ppt pdf |
Taylor 2009, ch. 16 |
05.01. | Statistical parametric (HMM) synthesis | ppt pdf | Taylor 2009, ch. 15 |
12.01. | Linguistic text analysis I | ppt pdf | Taylor 2009, sec. 7.4, 8.3-8.5 Möbius 2001, ch. 3-6 |
19.01. | Linguistic text analysis II | ||
26.01. | Prosody: Duration and intonation modeling | ppt pdf | Taylor 2009, ch. 9 |
02.02. | Questions and answers | ||
09.02. | End of term written exam (1) | ||
16.03. | End of term written exam (2) | Start at 10:15 a.m. Please arrive by 10:00 a.m. |
Exemplary exam questions (last update: Jan 28, 2011)
BibTex entries of all references (books, papers, URL):
@Book{Allen/etal:1987, author = {Allen, Jonathan and Hunnicutt, M.~Sharon and Klatt, Dennis}, title = {From Text to Speech: {T}he {MIT}alk System}, publisher = {Cambridge University Press}, year = 1987, address = {Cambridge}, annote = {tts, formant synthesis, textbook} } @InProceedings{Black/Taylor:1994, author = {Black, Alan W. and Taylor, Paul}, title = {C\textsc{hatr}: a generic speech synthesis system}, booktitle = {Proceedings of the International Conference on Computational Linguistics (Kyoto, Japan)}, volume = 2, year = 1994, pages = {983--986} } @Book{Breiman/etal:1984, author = {Breiman, Leo and Friedman, Jerome~H. and Olshen, Richard~A. and Stone, Charles~J.}, title = {Classification and Regression Trees}, publisher = {Wadsworth \& Brooks}, year = 1984, address = {Pacific Grove, CA} } @InCollection{Campbell:1992, author = {Campbell, W. Nick}, title = {Syllable-based segmental duration}, editor = {Bailly, G{\'e}rard and Beno{\^{\i}}t, Christian and Sawallis, Thomas R.}, booktitle = {Talking Machines: Theories, Models, and Designs}, publisher = {Elsevier}, year = 1992, address = {Amsterdam}, pages = {211--224} } @Article{Campbell:1999, author = {Campbell, W. Nick}, title = {A call for generic-use large-scale single-speaker speech corpora and an example of their application in concatenative speech synthesis}, journal = {Technical Publications, ATR Interpreting Telecommunications Research Laboratories}, year = 1999, pages = {42--47}, annote = {unit selection} } @Article{Carlson/Granstrom:1991, author = {Carlson, Rolf and Granstr{\"o}m, Bj{\"o}rn}, title = {Speech synthesis development and phonetic research---a personal introduction}, journal = {Journal of Phonetics}, year = 1991, volume = 19, pages = {3--8}, annote = {synthesis} } @Book{Clark/Yallop:1995, author = {Clark, John and Yallop, Colin}, title = {An Introduction to Phonetics and Phonology}, publisher = {Blackwell}, year = 1995, address = {Oxford}, edition = {2nd}, note = {1st edition 1990} } @Book{Clark/etal:2007a, author = {Clark, John and Yallop, Colin and Fletcher, Janet}, title = {An Introduction to Phonetics and Phonology}, publisher = {Blackwell}, year = 2007, address = {Oxford}, edition = {3rd}, annote = {textbook, phonetics} } @Article{Clark/etal:2007b, author = {Clark, Robert A.~J. and Richmond, Korin and King, Simon}, title = {Multisyn: Open-domain unit selection for the {Festival} speech synthesis system}, journal = {Speech Communication}, year = 2007, volume = 49, number = 4, pages = {317--330}, annote = {unit selection, synthesis, Festival, voice building, overview} } @Article{Dudley:1939a, author = {Homer Dudley}, title = {The vocoder}, journal = {Bell Labs Record}, year = 1939, volume = 17, pages = {122--126} } @Book{Dutoit:1997, author = {Dutoit, Thierry}, title = {An Introduction to Text-to-Speech Synthesis}, publisher = {Kluwer}, year = 1997, address = {Dordrecht}, annote = {Review by Eileen Fitzpatrick in CL 24 (2), 1998, 322--323}, annote = {textbook, synthesis, tts} } @Article{Fant:1953, author = {Fant, Gunnar}, title = {Speech communication research}, journal = {Ing. Vetenskaps Akad. Stockholm}, year = 1953, volume = 24, pages = {331--337}, annote = {formant synthesis, OVE I} } @Book{Fant:1960, author = {Fant, Gunnar}, title = {Acoustic Theory of Speech Production}, publisher = {Mouton}, year = 1960, address = {The Hague} } @InCollection{Fujisaki:1983, author = {Fujisaki, Hiroya}, title = {Dynamic characteristics of voice fundamental frequency in speech and singing}, booktitle = {The Production of Speech}, publisher = {Springer}, year = 1983, editor = {MacNeilage, Peter F.}, address = {New York}, pages = {39--55} } @Article{Fujisaki:1987, author = {Fujisaki, Hiroya}, title = {A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contours}, journal = {Annual Bulletin of the Research Institute for Logopedics and Phoniatrics (Tokyo)}, year = 1987, volume = 21, pages = {165--175} } @Article{Fujisaki/etal:1979b, author = {Fujisaki, Hiroya and Hirose, Keikichi and Ohta, K.}, title = {Acoustic features of the fundamental frequency contours of declarative sentences in {J}apanese}, journal = {Annual Bulletin of the Research Institute for Logopedics and Phoniatrics (Tokyo)}, year = 1979, volume = 13, pages = {163--172} } @Article{Holmes:1973, author = {Holmes, John N.}, title = {The influence of glottal waveform on the naturalness of speech from a parallel formant synthesizer}, journal = {IEEE Transactions AU}, year = 1973, volume = 21, pages = {298--305} } @InProceedings{Hunt/Black:1996, author = {Andrew J. Hunt and Alan W. Black}, title = {Unit selection in a concatenative speech synthesis system using a large speech database}, booktitle = {Proceedings of the {IEEE} International Conference on Acoustics and Speech Signal Processing (M{\"u}nchen, Germany)}, year = 1996, volume = 1, pages = {373--376}, annote = {unit selection} } @InProceedings{Iwahashi/Sagisaka:1993, author = {Iwahashi, Naoto and Sagisaka, Yoshinori}, title = {Duration modelling with multiple split regression}, booktitle = {Proceedings of the European Conference on Speech Communication and Technology (Berlin, Germany)}, year = 1993, volume = {??}, pages = {329--332}, annote = {tts, prosody, duration} } @Article{Kaplan/Kay:1994, author = {Kaplan, Ronald and Kay, Martin}, title = {Regular models of phonological rule systems}, journal = {Computational Linguistics}, year = 1994, volume = 20, pages = {331--378} } @Book{Kempelen:1791, author = {Kempelen, Wolfgang von}, title = {{Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine}}, publisher = {J. V. Degen}, year = 1791, address = {Wien}, note = {Facsimile Neudruck, 1970, der Ausgabe Wien 1791 mit einer Einleitung von Herbert E. Brekle und Wolfgang Wildgen. Friedrich Frommann, Stuttgart} } @Article{Klatt:1980a, author = {Klatt, Dennis H.}, title = {Software for a cascade/parallel formant synthesizer}, journal = {Journal of the Acoustical Society of America}, year = 1980, volume = 67, number = 3, pages = {971--980}, annote = {synthesis, Klatt80 synthesizer} } @Article{Kratzenstein:1782, author = {Kratzenstein, Christian Gottlieb}, title = {Sur la naissance de la formation des voyelles}, journal = {Journal de Physique}, year = 1782, volume = 21, pages = {358--380}, note = {French translation of: Tentamen coronatum de voce, Acta Acad. Petrog., 1780} } @Book{Leeuwen:1990, editor = {van Leeuwen, J.}, title = {Handbook of Theoretical Computer Science}, publisher = {Elsevier, Amsterdam; MIT Press, Cambridge, MA}, year = 1990, volume = {B}, annote = {FST, fsm; ISBN 0 444 88074 7} } @Book{Mobius:1993a, author = {M{\"o}bius, Bernd}, title = {{Ein quantitatives Modell der deut\-schen Intona\-tion---Analyse und Synthese von Grund\-frequenz\-ver\-l{\"a}ufen}}, publisher = {Niemeyer}, year = 1993, OPTnumber = 305, OPTseries = {Linguistische Arbeiten}, address = {T{\"u}bingen} } @Article{Mobius:1999, author = {M{\"o}bius, Bernd}, title = {{The Bell Labs German text-to-speech system}}, journal = {Computer Speech and Language}, year = 1999, volume = 13, pages = {319--358}, annote = {tts, german} } @Book{Mobius:2001a, author = {M{\"o}bius, Bernd}, title = {German and Multilingual Speech Synthesis}, publisher = {University of Stuttgart}, year = 2001, series = {Arbeitspapiere des Instituts f{\"u}r Maschinelle Sprachverarbeitung (Univ. Stuttgart), AIMS 7 (4)}, pages = {1--300}, annote = {synthesis, textbook, tts} } @InProceedings{Mobius/Santen:1996, author = {M{\"o}bius, Bernd and van Santen, Jan}, title = {Modeling segmental duration in {G}erman text-to-speech synthesis}, booktitle = {Proceedings of the International Conference on Spoken Language Processing (Philadelphia, PA)}, year = 1996, volume = 4, pages = {2395--2398} } @Article{Mohri:1997, author = {Mohri, Mehryar}, title = {Finite-state transducers in language and speech processing}, journal = {Computational Linguistics}, year = 1997, volume = 23, number = 2, pages = {269--311} } @InCollection{Mohri/etal:1998, author = {Mohri, Mehryar and Pereira, Fernando and Riley, Michael}, title = {A rational design for a weighted finite-state transducer library}, booktitle = {Lecture Notes in Computer Science 1436}, publisher = {Springer}, year = 1998, editor = {Wood, D. and Yu, S.}, address = {New York}, pages = {144--158}, annote = {fsm} } @InProceedings{Nikleczy/Olaszy:2003, author = {Nikl{\'e}czy, P. and Olaszy, Gabor}, title = {A reconstruction of {Farkas Kempelen's} speaking machine}, booktitle = {Proceedings of the European Conference on Speech Communication and Technology (Geneva, Switzerland)}, year = 2003, pages = {2453--2456}, annote = {Kempelen, speaking machine, synthesis} } @Article{Ohman:1967, author = {{\"O}hman, Sven E. G.}, title = {Word and sentence intonation: a quantitative model}, journal = {Speech Transmission Laboratory---Quarterly Progress and Status Report}, year = 1967, volume = {2--3}, pages = {20--54} } @Article{Ohman/Lindqvist:1966, author = {{\"O}hman, Sven E. G. and Lindqvist, Jan}, title = {Analysis-by-synthesis of prosodic pitch contours}, journal = {Speech Transmission Laboratory---Quarterly Progress and Status Report}, year = 1966, volume = 4, pages = {1--6} } @InCollection{Olive/etal:1998, author = {Olive, Joseph and van~Santen, Jan and M{\"o}bius, Bernd and Shih, Chilin}, title = {Synthesis}, booktitle = {Multilingual Text-to-Speech Synthesis: The {B}ell {L}abs Approach}, editor = {Richard Sproat}, publisher = {Kluwer}, year = 1998, address = {Dordrecht}, chapter = 7, pages = {191--228} } @PhdThesis{Pierrehumbert:1980, author = {Pierrehumbert, Janet}, title = {The phonology and phonetics of {E}nglish intonation}, school = {MIT}, address = {Cambridge, MA}, year = 1980 } @Book{PompinoMarschall:1995, author = {Pompino-Marschall, Bernd}, title = {{Einf{\"u}hrung in die Phonetik}}, publisher = {de Gruyter}, year = 1995, address = {Berlin} } @InCollection{Riley:1992, author = {Riley, Michael D.}, title = {Tree-based modeling for speech synthesis}, editor = {Bailly, G{\'e}rard and Beno{\^{\i}}t, Christian and Sawallis, Thomas R.}, booktitle = {Talking Machines: Theories, Models, and Designs}, publisher = {Elsevier}, year = 1992, address = {Amsterdam}, pages = {265--273} } @Article{Santen:1993b, author = {van Santen, Jan P.~H.}, title = {Exploring \textit{N}-way tables with sums-of-products models}, journal = {Journal of Mathematical Psychology}, year = 1993, volume = 37, number = 3, pages = {327--371} } @InCollection{Santen:1998a, author = {van Santen, Jan P. H.}, title = {Timing}, booktitle = {Multilingual Text-to-Speech Synthesis: The {B}ell {L}abs Approach}, editor = {Sproat, Richard}, publisher = {Kluwer}, year = 1998, address = {Dordrecht}, pages = {115--139}, annote = {tts, duration} } @InCollection{Schweitzer/etal:2006a, author = {Schweitzer, Antje and Braunschweiler, Norbert and Dogil, Grzegorz and Klankert, Tanja and M{\"o}bius, Bernd and M\"{o}hler, Gregor and Morais, Edmilson and S{\"a}uberlich, Bettina and Thomae, Matthias}, title = {Multimodal speech synthesis}, booktitle = {{SmartKom}: Foundations of Multimodal Dialogue Systems}, editor = {Wahlster, Wolfgang}, publisher = {Springer}, year = 2004, pages = {411--435}, annote = {sk, synthesis, unitsel} } @InProceedings{Silverman/etal:1992, author = {Silverman, Kim and Beckman, Mary and Pitrelli, John and Ostendorf, Mari and Wightman, Colin and Price, Patti and Pierrehumbert, Janet and Hirschberg, Julia}, title = {{ToBI: A standard for labelling English prosody}}, booktitle = {Proceedings of the International Conference on Spoken Language Processing (Banff, Alberta)}, year = 1992, volume = 2, pages = {867--870} } @TechReport{Sproat:1995a, author = {Sproat, Richard}, title = {{LEXTOOLS}: {T}ools for finite-state linguistic analysis}, institution = {AT\&T Bell Laboratories}, year = 1995, note = {11522-951108-10TM} } @InProceedings{Sproat:1995b, author = {Sproat, Richard}, title = {A finite-state architecture for tokenization and grapheme-to-phoneme conversion in multilingual text analysis}, booktitle = {{From text to tags---Issues in multilingual language analysis. Proceedings of the ACL SIGDAT Workshop}}, year = 1995, address = {University College, Belfield, Dublin, Ireland} pages = {65--72} } @Book{Sproat:1998, title = {Multilingual Text-to-Speech Synthesis: The {B}ell {L}abs Approach}, editor = {Sproat, Richard}, publisher = {Kluwer}, year = 1998, address = {Dordrecht}, annote = {ISBN 0-7923-8027-4; Review by Douglas O'Shaughnessy in Computational Linguistics 24(4), 1998, 656--658}, annote = {textbook, tts, synthesis} } @Article{Syrdal/etal:1997, author = {Syrdal, Ann K. and Conkie, Alistair and Stylianou, Yannis and Schroeter, J{\"u}rgen and Garrison, L.F. and Dutton, D.}, title = {Voice selection for speech synthesis}, journal = {Journal of the Acoustical Society of America}, year = 1997, volume = 102, number = 5, pages = 3191, note = {(abstract)}, annote = {tts} } @InProceedings{Syrdal/etal:1998b, author = {Syrdal, Ann K. and Conkie, Alistair and Stylianou, Yannis}, title = {Exploration of acoustic correlates in speaker selection for concatenative synthesis}, booktitle = {Proceedings of the International Conference on Spoken Language Processing (Sydney, Australia)}, year = 1998, volume = 6, pages = {2743--2746}, annote = {tts, inventory, speaker} } @Book{Taylor:2009, author = {Taylor, Paul}, title = {Text-to-Speech Synthesis}, publisher = {Cambridge University Press}, year = 2009, annote = {textbook, synthesis, tts} } @Book{Ungeheuer:1962a, author = {Ungeheuer, Gerold}, title = {{Elemente einer akustischen Theorie der Vokalartikulation}}, publisher = {Springer}, year = 1962, address = {Berlin} } @Article{Wheatstone:1838, author = {Wheatstone, Charles}, title = {Art. II. -- 1. On the vowel sounds, and on Reed Organ Pipes. By Robert Willis [...] 2. Le Méchanisme de la Parole, suivi de la Description d'une Machine Parlante. Par M. de Kempelen [...] 3. C.G. Kratzenstein. Tentamen Coronatum de Voce. [...]} journal = {The London and Westminster Review}, year = 1838, pages = {27ff.}, annote = {Review of von Kempelen, Kratzenstein, Willis} }