Temporal Constraints on the Production of Frequent and Infrequent Syllables Antje Schweitzer and Bernd Möbius Institute of Natural Language Processing, University of Stuttgart, Germany According to Guenther and Perkell's speech production model [JPhon 28(3),2000], there is a unique phonetic target region in auditory-temporal space for each phoneme of a given language [Guenther et. al., Psychological Review 105, 1998]. We have extended this model to the prosodic domain. Moreover, we propose to integrate it with an exemplar-theoretical view by positing that accumulations of exemplars implicitly define corresponding regions in perceptual space that serve as targets in the production of prosody [Schweitzer&Moebius, Proc.ICPhS-2003, to appear]. Thus, the speaker has access to stored representations of prosodic events, including their tonal and temporal structure, that serve as a reference in speech production. In the temporal dimension, the realization of a unit (i.e. segment or syllable) relative to the temporal target region can be modeled using z-scores of unit durations. Z-scores are unit durations normalized by unit-specific mean duration and standard deviation. Mean duration and standard deviation represent the extension of the stored representations in the temporal dimension. The z-score thus indicates the deviation of a particular realization from other realizations. In other words, the z-score indicates the lengthening or shortening of a unit compared to other realizations of this unit. An exemplar-theoretic approach is compatible with the existence of a mental syllabary in which realizations of the most frequent syllables are stored. Syllables assumed to be stored in the syllabary exhibit more coarticulation than rare syllables [Whiteside&Varley, Proc.ICSLP-1998], which are assembled on-line from smaller units. Analogously, we expected to find less variation for infrequent syllables than for frequent syllables in the temporal domain, when comparing syllable z-scores to the mean segment z-scores of the involved segments. This hypothesis is motivated by the assumption that, since the speaker has no access to syllable-specific mean duration and standard deviation for infrequent syllables, lengthening or shortening is planned for each segment relative to the respective segment mean and standard deviation. For frequent syllables, on the other hand, enough representations are available to plan lengthening or shortening relative to the mean and standard deviation of the stored representations of this syllable. We examined syllables in a corpus of 160 minutes of speech from a professional male speaker recorded as a database for unit selection speech synthesis. This corpus had been designed to maximize coverage of phoneme-phoneme combinations, and therefore exhibits an unusual syllable frequency distribution with disproportionately many instances of some otherwise infrequent syllables. The frequency classification of the syllables was based on probabilistic syllable classes induced from multivariate clustering [Mueller,Proc.ACL-2000], which allows estimation of the theoretical probability even for unseen syllables. From all syllable types which occurred at least 20 times in our database, we chose the 20 most infrequent types and the 153 most frequent types, corresponding to estimated probabilities of less than 0.00005 and more than 0.001, respectively. Some of the most frequent syllables obviously came from frequent function words which usually do not carry pitch accents. To avoid overrating the effects of typical prosodic contexts, we explicitly excluded such syllables, only taking types into account for which we had at least one instance carrying a pitch accent in the database. This left 9721 realizations of 96 frequent types and 450 realizations of 15 infrequent types. We calculated linear regression models for both sets separately, using the syllable z-score as the predictor variable and the mean segment z-score for the syllable as the predicted variable. The residual standard errors were 0.46 and 0.40 for frequent and infrequent syllables respectively, indicating stronger variation for the frequent syllables. The Bartlett test confirmed that the difference in variation for the residuals was significant with p<0.0001, supporting our hypothesis.