Computational Linguistics & Phonetics Computational Linguistics & Phonetics Fachrichtung 4.7 Universität des Saarlandes
Cross-language and individual differences in the production and perception of syllabic prominence. Rhythm-typology revisited.

Cross-language and individual differences in the production and perception of syllabic prominence. Rhythm-typology revisited.

William J. Barry, Bistra Andreeva, Sibylle Kötzer; Universität des Saarlandes

Project summary

The primary project plan (cf. Proposal BA 737/10-1, dated 28.10.2005) is to examine in the production and perception: (a) how different languages exploit the universal (= psycho-acoustically determined) means of modifying the prominence of words in an utterance; (b) whether the different word-phonological requirements of a language affect the degree to which the properties are exploited, and (c)whether differences between languages are greater than the differences between speakers of a language, and what the implications for both traditional isochrony-based and more recent structure-based rhythm typology groupings are (years 1 and 2). We are NOT investigating "word stress/word accent" but rather the change in a given word as a result of making it more or less informationally prominent in the utterance. A second goal (year 3) take the hypothesis that "rhythm" is the holistic impression resulting from the processing of a given utterance embedded in a given communicative frame as its point of departure. Using strictly controlled iterative speech, it made a first attempt to address the dependency of "rhythmic expectation" (as an aid to lexical access) on the sum of the prosodic information preceding the target.


Languages and Speakers

The languages investigated in the project are assumed to belong to different "rhythm types" and also differ in basic phonological properties: variable vs. fixed word-stress (or lacking word stress); short- vs. long-vowel distinction; variable vs. simple syllable complexity. The following languages were recorded: English and German as NW European "stress-timed" languages; Russian and Bulgarian to complement these as "stress-timed" Slavonic languages with different vowel-reduction patterns; French as a clear "syllable-timed" candidate, since it has no lexical stress; Japanese as the representative of "mora-timing" and Norwegian as a language which is not so readily categorized.

Six regionally homogeneous speakers (3 male and 3 female) per language were recorded (for English speakers of Southern British English, for German speakers from the Saarland area who spoke Standard German, for French speakers of northern standard French, for Bulgarian speakers of Sofia-Bulgarian, for Russian speakers of Standard Russian from the Moscow area. The Norwegian informants were all speakers of Urban East Norwegian, and the Japanese informants were all speakers of Tokyo-Japanese. The regional homogeneity aimed at increasing the chance of a group hierarchy in the exploitation of the acoustic dimensions (i.e. the regional sub-stratum which could have influenced the establishment of their prominence-giving mechanisms was constant). All informants were tertiary-educated.



(1) Non-poetic Speech Corpus

In order to provide a basis for the direct comparison of parameter values across different conditions of phrasal accentuation, controlled utterances with the canonical word order per language were required which could be produced with de-accented and accented variants of the same words. We believe that a laboratory corpus, made up of several "artificial" utterances created specifically for the task is more reliable, since it permits the isolation of the variables under study as well as the neutralisation of other factors. Short sentences were constructed containing two one- or two-syllable "critical words" (CWs), one early (but not initial) and one late (but not final) in the sentence. For each sentence, a number of questions were devised to elicit a) a broad-focus response, b) a response with a non-contrastive narrow-focus on the early and c) on the late CW and d) a contrastive focus on the early and e) on the late CW.

An example set of one English test sentence plus five different context questions is given below; focus constituents are indicated by square brackets; the nuclear accented word is underlined.

(a) broad focus

A. What was the problem?
B. [The party from Carlton came.]

(b) early narrow non-contrastive focus

A. What was it from Carlton that came?
B. The [party] from Carlton came.

(c) late narrow non-contrastive focus

A: Which party came?
B: The party from [Carlton] came.

(d) early narrow contrastive focus

A: The partners from Carlton came?
B: The [party] from Carlton came.

(e) late narrow non-contrastive focus

A: The party from Barkham came?
B: The party from [Carlton] came.

There are potentially three degrees of prominence on the critical words: (a) deaccented, (b) pre-nuclear accented and (c) nuclear accented.

(2) Poetic Speech Corpus

In order to investigate the relationship between the concrete rhythmic measure and the rhythmic impression of the utterance that produced it we recorded verses with an iambic/trochaic, dactylic/anapaest or a more complex metre. These were children's 11 rhymes or specially constructed verses for English, German and Bulgarian. This material was considered to bridge the gap between rhythmically defined verse and normal prose.


The speakers produced 6 repetitions of each of the sentences from the non-poetic material and their 2 dada versions from a PPT presentation in response to the recorded questions in a sound-treated studio. To provide a (potential) basis for comparing the parameter modification across sentences independently of the different segmental structuring of the critical words (and thus, if possible, to derive a speaker- and/or language-specific quantification of the accent-dependent modification), a reiterative "dada" version of each realisation was produced immediately after the normal-text response. This was produced in two stages: (i) a da or dada replacement of the two CWs and (ii) a dada replacement of the whole sentence. Additionally four of the Bulgarian and four of the German speakers were recorded producing the English sentence set.

Two speakers (1 male, 1 female) per language (English, German and Bulgarian) produced the rhythmic speech (poetic material) in the form of simple rhymes in a more regular "doggerel" style rather than in an irregular "dramatic" style. In order to have rhythmically defined material without the variation in structural complexity, the speakers also produced all verses in an iterative dada form, repeating meaningful sections of the verses immediately after reciting the textual version.

The recordings were made using an AKG C420IIIPP headset on a Tascam DA-P1 DAT recorder and transferred digitally via the optical channel to a PC using the Kay Elemetrics MultiSpeech speechsignal processing program.

The sentences corpus consists of totally 22680 sentences (7 languages x 6 speakers x 6 sentences x 5 focus conditions x 6 repetitions x 3 versions). The additional L2-productions amount to 4320 sentences (8 speakers x 6 sentences x 5 focus conditions x 6 repetitions x 3 versions).

Segmentation, labelling with SAMPA and further processing was done using the Kiel XASSP speech signal analysis package. Six labelling assistants were allocated different sentences (to maximize labelling consistency across conditions within each sentence) and segmentation problems were regularly discussed and decided with the authors at group level.

Analysis methods

The four acoustic dimensions were calculating using praat scripts and operationalized as follows:

(a) Durations were calculated for all feet in the sentences, for the CWs and their component syllables as well as the syllables of the feet to which the CWs belonged. Furthermore, the duration of the phonetic sound-segments comprising the syllables were calculated. All durational measurements were normalised as a percentage of the mean duration of the corresponding unit in the sentence.

(b) Since comparisons focus on changes in identical words across conditions, F0 was calculated as the mean fundamental frequency (Hz) across the syllable nucleus (vowel or syllabic sonorant) of the lexically stressed syllable of CWs and in the unstressed syllable preceding and following it. The average F0 across the utterance was subtracted to normalize the F0 values. The absolute temporal distance from the F0 peak to syllable onset and rhyme onset were calculated. Due to the possible effect of the varying segmental durations on peak delay, the above absolute measures were converted to relative measures, taken as a 12 proportion of syllable and rhyme durations. Additionally, the maximum F0 value of the pitch target was measured in semitones (using a 70 Hz reference frequency).

(c) Intensity was measured in two ways: first, as the mean intensity (in dB) of the stressed vowel in the CW, and second, as the spectral balance in that vowel. This was computed as the energy difference between the frequency band from 70-1000 Hz and that from 1200-5000 Hz. This measure, too, was normalized by subtracting the spectral balance across the whole utterance.

(d) Spectral definition was captured with the mean frequency (and bandwidth) values for formants 1-3 of the syllabic nuclei in the lexically stressed syllable of CWs. Change as a function of accentual condition was expressed as percent difference from the broadfocus realisation in each formant separately for the vowel of the lexically stressed syllable in each CW.

Rhythm measures were calculated according to Ramus (1999), Ramus et. al (1999) and Grabe & Low (2002). With the exception of %V, which is simply the proportion to which an utterance is vocalic and hence a reflection of overall syllable complexity, the measures all address the variability of the vocalic and consonantal interval durations within a stretch of speech (ips). However, they differ in the type of variability they capture.

(e) Ramus'D-values are the standard deviation of the vowel or consonantal intervals, i.e., a global variability measure which reflects nothing of the sequential patterning of durations which might logically be seen as underlying any auditory impression of rhythm. G&L's PVI measure (Pair-wise Variability Index) does take the sequential variability into consideration by averaging the durational difference between consecutive vowel or consonantal intervals.

(f) A normalised version of the PVI formula, used for PVI-V calculations and devised to correct for tempo fluctuations, relates the difference between consecutive intervals to the mean duration of the two intervals.

(g) A number of other measures were calculated, which were considered to offer illumination of the rhythm-tempo relationship: The ratio of number of consonants/number of vowels as a rough measure of syllable complexity in an ips The ratio of vowel-duration / consonant-duration (as a measure of the temporal structuring of the syllable).

Several analysis methods have been applied to the acoustic parameters a-d. For the sake of optimal comparability across languages, most of the analyses were carried out on the reiterant "dada" utterances, although the results were always verified for the text (replies to elicitating questions) as well:

(a) As a first quantification of the relative importance of the parameter values in the different languages, the change between the parameter values in the CWs between the broad and early/late narrow focus conditions were computed in percent points. These analyses showed clear differences across the investigated languages, but did not evaluate their statistical significance.

(b) The data were therefore also analyzed statistically, using multivariate analyses of variance (MANOVA). Clear main effects were found for the factor Language, but there were also many significant interactions between the factors Language and Focus. The interactions present a partial answer to the main question addressed in this project, namely whether languages differ in their use of acoustic parameters to signal functional differences depending on their structural properties. But the contribution of the MANOVAs is modest when it comes to finding differences between the languages in terms of the relative importance of the acoustic cues for accentuation.

(c) Stepwise discriminant analyses were therefore carried out. These select the acoustic parameters in the order in which they contribute to explain the total variance in the data for a given language. Differences between the languages in the order of inclusion of parameters in the discriminant model thus give us information about their relative importance. The interpretation of the results can be problematical when the parameters which are analyzed are correlated, as is often the case for the acoustic cues in our data. In this case, one parameter may be favoured over another and included in the model, even if the second, correlated parameter explains nearly as much of the total variance.

(d) To avoid such artefacts caused by the correlations among the acoustic parameters, the eta coefficients from multivariate analyses of variance were later used to rank the acoustic parameters according to their importance for the distinction between accented and deaccented CWs. The eta coefficients are a ratio of the variance explained by an acoustic parameter divided by the total variance (irrespective of the contribution of other parameters).

(e) Finally, the statistical results were complemented by phonetic and phonological analysis of the data. These analyses are an important complement to the statistical analyses, and allow us to extract interpretable phonetic knowledge from the analyses. This is particularly important in the case of fundamental frequency, since this cue reflects not only continuous, phonetic variation, but also discrete, phonological differences between accent types, which make this parameter less amenable to the types of statistical analysis previously used.


Barry, W.J. and Andreeva, B. (in preparation). Cross-language and individual differences in the production and perception of syllabic prominence. Journal of the Acoustical Society of America.

Andreeva, B. and Barry,W.J. (in preparation). Phonetic and Phonological Markers of the Information Structure in Various Languages. Journal of Phonetics.

Andreeva, B. (accepted). Focus and Prominence in Bulgarian and Russian. Inter-Language and Inter-Speaker Variation. Formal Approaches to Slavic Linguistics - FASL 19, April 23-25, 2010, University of Maryland, College Park.

Andreeva, B. (accepted), Intonacija i rit&acaron;m v b&acaron;lgarski i ruski. Deseti slavistični četenija, April 22-24, 2010, University of Sofia.

Barry, W.J. and Andreeva, B. (to appear). Losing the trees in the wood? Reflections on the measurement of spoken-language rhythm. In: Michela Russo (ed.) Gli universali prosodici: confronto e ricerche sulla modellizzazione ritmica e sulle tipologie ritmiche (= Biblioteca di Linguistica), Roma: Aracne.

Andreeva, B., Dimitrova, S. and Barry, W.J. (to appear). Prosodic transfer in L2 speech: Evidence from phrasal prominence and rhythm in English, Bulgarian and German, Proc. 13th annual conference of the Bulgarian Society for British Studies - BSBS, November 7 - 9 2008, Sofia University St. Kliment Ohridski, Sofia, Bulgaria.

Trifsik, O. (2010).Perzeption von Äußerungsprominenz im Russischen. Master Thesis.

Mixdorff, H., Andreeva, B. and Koreman, J. (2010). Quantitative Modeling of Norwegian Tonal Accents in Different Focus Conditions, Speech Prosody, 5th International Conference Chicago, Illinois, Mai 11 - 14 2010.

Klinger, A. (2009). Perzeption von Äußerungsprominenz im Deutschen. Master Thesis.

Barry, W.J., Andreeva, B. and Koreman, J. (2009). Do Rhythm Measures Reflect Perceived Rhythm?, Phonetica, Vol. 66, No. 1-2, pp. 78-94.

Koreman, J., Andreeva, B., Barry, W.J., Wim Van Dommelen and Rein-Ove Sikveland (2009). Cross-language differences in the production of phrasal prominence in Norwegian and German, In: Martti Vainio, Reijo Aulanko, and Olli Aaltonen (eds.), Nordic Prosody, Proceedings of the Xth Conference, Helsinki 2008, Frankfurt: Peter Lang, 139-150.

Koreman, J., Andreeva, B. and Barry, W.J. (2008). Accentuation cues in French and German, Proc. Fourth Conference on Speech Prosody, May 6-9 2008, Campinas, Brazil, pp.613-616.

Russo, R. and Barry, W.J. (2008). Isochrony reconsidered. Objectifying relations between rhythm mesures and speech tempo, Proc. Fourth Conference on Speech Prosody, May 6-9 2008, Campinas, Brazil, pp. 419-422.

Barry, W.J. and Russo, M. (2008). "Measuring rhythm. A quantified analysis of Southern Italian Dialects' Stress Time Parameters", In: Antonio Pamies et al. (eds.), Experimental Prosody (= Special Issue 2, Language Design. Journal of Theoretical and Experimental Linguistics), 315-322.

Andreeva, B., Barry, W.J. and Steiner, I. (2007). Producing Phrasal Prominence in German, Proc. International Congress of Phonetic Sciences, Saarbruecken (August 2007), pp. 1209-1212.

Barry, W.J., Andreeva, B. and Steiner, I. (2007). The Phonetic Exponency of Phrasal Accentuation in French and German, Proc. Interspeech, Antwerp (August 2007), pp. 1010-1013.

Oral presentations:

Andreeva, B.,Barry, W.J. and Koreman, J. (2010). Rhythm-typology revisited. 32. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft (DGfS). AG: Prosodic Typology: State of the Art and Future Prospects, Humboldt-Universitdt zu Berlin, 23.-26.02.2010.

Andreeva, B. and Barry, W.J. (2009).Cross-language and individual differences in the production and perception of syllabic prominence. Rhythm- typology revisited. 3rd annual meeting of the DFG-Priority Programme 1234 "Phonological and phonetic competence: between grammar, signal processing, and neural activity", March 1-2, 2009, Cologne, Germany.

Barry, W.J. and Andreeva, B. (2008). Perceiving rhythm. Is it language- or listener-dependent?, Workshop on Prosodic Universals, 15 October, Paris, France.

Barry, W.J. and Andreeva, B. (2008). Sprechrhythmus und Sprachtyp; ein kritischer Blick auf Rhythmusmaße. Institute of Phonetics and digital Speech Processing at the Christian-Albrechts-University at Kiel, Mai 2008, Germany.

Barry, W.J. (2008). Speech Rhythm and Language Type; a critical look at rhythm measures, April 2008, Laboratory of Phonetics and Speech Technology, April 2008, Tallinn University of Technology, Estonia.

Russo, M. and Barry, W.J. (2008). The Pairwise Variability Index. Rhythmic Values from Italian Dialects and Language Typological Implications. Séminaire de l'UMR 7023: Structures Formelles du Language, 31 March 2008, Univ. Paris, France.

Barry, W.J., Andreeva, B. and Koreman, J. (2008). Do Rhythm Measures Reflect Perceived Rhythm?, Workshop on Empirical Approaches to Speech Rhythm 2008 , 28th March 2008, UCL, UK.

Andreeva, B. and Barry, W.J. (2007). Cross-language and Individual Differences in the Production of Phrasal Prominence in Bulgarian and Russian, Proc. 7th European Conference on Formal Description of Slavic Languages, FDSL-7, 30 November - 2 December, 2007, Leipzig, Germany.

Barry, W.J. (2007). Prominence across languages - production differences and their implications. 7th annual research meeteng of the International Research Training Group Language Technology and Cognitive Systems, Juni 5-15, Schloss Dagstuhl, Saarland, Germany.

Barry, W.J. and Andreeva, B. (2007). Cross-language and individual differences in the production and perception of syllabic prominence. Second annual meeting of the SPP 1234, October 6 and 7, 2007, MPI Nijmegen, Netherlands.

Barry, W.J. (2006). Rhythm in languages, in speech (technology) and in general. Synonym or just homographic homophones? 6th annual research meeting of the International Research Training Group Language Technology and Cognitive Systems, September 6-15, Edinburgh, UK.