SHOW reference #737 - References

Browse

Reference no #737

Type : Html | Bib | Both

Created: 2007-12-12 11:30:33

@InProceedings{Koreman_Andreeva:2000,
AUTHOR = {Koreman, Jacques and Andreeva, Bistra},
TITLE = {Phonetic features in ASR: A linguistic solution to acoustic variation?},
YEAR = {2000},
BOOKTITLE = {Proceedings of the 7th Conference on Laboratory Phonology (LabPhon7), June 29 -July 1},
ADDRESS = {Nijmegen, Netherlands},
ABSTRACT = {In most phonological theories, phonemes are considered as a set (or hierarchy) of (possibly underspecified) phonetic features, which are the minimal number of formal properties needed to distinguish the phonemes in the language system from each other. In most state-of-the-art automatic speech recognition (ASR) systems, however, phonetic features do not play any role. The statistical models for each phone or phoneme are based on a spectral parameterisation of the signals, like mel-frequency cepstral coefficients (MFCC's) and energy. Three questions are dealt with in this paper: Can we successfully bridge this gap between phonological theory and ASR by using phonetic features in ASR? Which phonetic feature set is most appropriate for ASR? Can we attain the same result by using more complex non-linguistic modelling? 1. PHONETIC FEATURES IN ASR To bridge the gap between phonologists' formal representation of the phoneme and the almost purely acoustic description of the signal used in ASR systems, we have used phonetic features to create statistical phone models for automatic speech recognition. The phonetic features were derived from the spectral representation of the signal used in most standard ASR systems (MFCC's + energy) by means of a neural network. Not only do we find a clear increase in the phoneme identification rate (see under 2 below) [1], the confusions between phonemes are also much easier to interpret, since phonemes which are confused are usually very similar in terms of the phonetic features they are made up of. This is not the case when acoustic parameters are used to create phoneme models [2]. 2. DIFFERENT PHONETIC FEATURE SETS It is not self-evident which set of phonetic features is most appropriate to describe phonological categories and the processes that operate on them, since the various feature theories have different phonological implications. To evaluate how appropriate the different feature sets are for application in an ASR system, we have used several different feature sets, both articulatory-phonetic (IPA) and phonological (SPE) [3]. We have so far compared the phoneme identification results for both underspecified and fully specified SPE features with those for the set of features used in the IPA to distinguish all phonemes. In addition, the results were compared to those in a standard ASR system using acoustic parameters (MFCC's) directly to create phone models. We found a clear improvement in the phoneme identification rate when phonetic features were used to model the phones, in comparison to directly using acoustic parameters. Underspecified SPE features led to the best performance (for multi-language Eurom0 data, without the use of a lexicon or language model) of all: acoustic parameters: 15.6% IPA features: 42.6% SPE features: 36.2% Underspecified SPE features: 46.1% In addition to the features sets reported so far, the phoneme identification results for articulatory features [4] will be reported and their relative merits will be discussed. 3. VARIATION MODELLING VERSUS LINGUISTIC MODELLING The acoustic-phonetic mapping in a neural network combines two advantages, namely 1) variation modelling: different acoustic realisations of the same phoneme (e.g. allophonic variants) can be discerned by the neural network 2) linguistic modelling: these different realisations are mapped onto more homogeneous, distinctive features Even if the neural network can reduce the variation in the input parameters for statistical modelling by mapping different acoustic realisations of a phoneme onto phonetic features, the question remains whether the same result can be reached by using a non-linguistic approach. Variation modelling can also be achieved by using more complex acoustic phoneme models (multiple mixtures per state in HMM), so that we do not necessarily have to make a mapping onto phonetic features to achieve this goal. A comparison of the performance of a standard system which does not use phonetic features with the performance of a system in which phonetic features are used to train the phoneme models shows the merits of using a signal representation derived from phonological theory.}
}

Last modified: Thu October 16 2014 19:11:34