PROSODY 2000 ABSTRACTS

 

This is the final collection of Prosody 2000 abstracts. The abstracts are sorted according to the family name of the first author. If you want to search them according to other criteria, please use the search function of your browser. The participants of the conference will receive printed copies of this text.

 

 

  

Labeling of prosodic domains: Constraints from brain responses

K. Alter, C. Hruska

MPI of Cognitive Neuroscience, Leipzig

We designed two auditory studies to identify the processing of intonational phrases (IPh) and accents in German using event related brain potentials (ERP).

In the first experiment context-free utterances with 'neutral intonation' were presented. The ERPs demonstrate that syntactically driven Iph boundaries (right edges) correlate with a typical ERP pattern, so called closure positive shift (CPS).

In the second experiment context questions establishing a narrow focus on different constituents in the utterances/answers were introduced. Acoustic data reveal the presence of IPh boundaries at positions as predicted by models of syntax-prosody-mapping. In contrast to the first experiment, these IPh boundaries seem to be 'ignored' by the parser.

In the listener's ERP, the CPS does not appear at the right edges of syntactically driven IPh boundaries, but CPS followed immediately focused constituents. This assumed, that these elements are critical for phrasing. However the acoustic data does not show indices of Iph boundaries (H%, pause insertion, prefinal lengthening) after (at) focused constituents.

We will discuss whether the moved CPS results from different brain reactions or if the CPS even reflects sensibility for focus driven Iph reconstruction. In the latter case, boundary marking in standard models like ToBI should need some extension.

References

Steinhauer, K., Alter, K. & Friederici, A.D. (1999). Brain potentials indicate immediate use of prosodic cues in natural speech processing. Nature Neuroscience, 2 (2), 191-196.

 

Prosody modification algorithm for vocal allophones based on TD-PSOLA technology for concatenation TTS system

A. Babkin

Moscow Lomonosov State University, Philological Faculty, Department of Theoretical and Applied Linguistics

This report is about the particularity of using TD-PSOLA technology in Russian female voice text-to-speech (TTS) system being developed at the Faculty of Philology of Moscow State University. The well-known technology TD-PSOLA describes the theoretical way of changing pitch and duration of speech signal. In order to develop the high quality TTS system one needs to use additional algorithms and methods. In this article the author describes all steps in development prosody modification algorithm for vocal allophones based on TD-PSOLA technology for concatenation TTS system, and pay additional attention to the ways of increasing naturalness of synthesized speech SUMMARY (Short review of the paper ). One of the approaches in the creation of the high quality TTS system is the concatenative approach. Formation of the synthesized speech signal occurs in this case by means of connection of the acoustic waveform samples which are called elements of concatenation. The elements of concatenation are formed from the initial samples of the speech signal, storing in the database, by means of modification of their prosodic characteristics (such as duration, fundamental frequency and energy) in accordance with the requirements of the natural language processing module. The theoretical foundation for the developing our methods of forming the prosodic characteristics of speech signal is TD-PSOLA approach. The main idea of TD-PSOLA methods consists in the following: the initial allophone is multiplied by sequence of time windows synchronized with fundamental frequency. The received sequence of acoustic segments, which are preliminary shifted about each other, is summed up, thus making the required modified allophone. To change the duration of the allophone the technology of repetition or elimination of some acoustic segments is used. In the traditional realization of this algorithm, in case of noticeable increase of the duration of speech signal, and caused by this many-timed repetition of some identical segments, a particular unnaturalness is observed in perception of the resultant speech. To make the phonation more natural the given report suggests special algorithms based on random repetition and making some changes in the sequence of the identical acoustic segments. In our Russian speech synthesis system base elements of concatenation, in the majority of cases, have the phonemic measurement and, thus, are allophonic realizations of the traditional phonemes. One of the main requirements, which essentially increase quality of the synthesized speech, is minimization of the distortions in acoustic characteristics of the transitional parts of the allophone. Within the framework of this requirement the modification of the fundamental frequency is realized along the whole length of the initial allophone; alteration the duration of the allophone occurs only on its specially calculated parts, which is called stationary section. The algorithm of calculation and search for the stationary section are described in the report. The article contains description of all the blocks of the created algorithm for formation of prosodic characteristics of vocal allophones. A particular attention is given to the methods which essentially increase the naturalness of the synthesized speech signal. All the algorithms and methods suggested in the report have passed the special testing program and are realized as computer program, which makes part of the Russian text-to speech system being developed at MSU.

 

 

 

Prosodic models and speech recognition: towards the common ground

A. Batliner, E. Noeth, B. Möbius, G. Möhler

Chair for Pattern Recognition, University of Erlangen—Nuremberg, Martensstr. 3, D-91058 Erlangen, Germany

Institute for Computational Linguistics, University of Stuttgart, Atzenbergstr. 12, D-70174 Stuttgart, Germany

In spite of the claim made by many researchers that prosody is a valuable source of knowledge in automatic speech recognition and understanding (ASU), it has not been used up to now to a considerable extent. Partly, this might be due to the fact that it’s role is more important in more elaborated speech whereas until recently, the main emphasis was on dictation systems or on rather simple dialogue systems. In our opinion, a second, maybe even more important point is that mainstream prosodic models are not designed for the use in ASU which means in turn that they are in their original state - not well suited for this task: ASU needs a functional representation - which means, a genuine phonological representation: units should only be modelled if they denote a clear-cut difference in linguistic meaning. In addition, the prosodic features classification is based on should be ‘flat’, i.e., close to the surface and not too much influenced by theoretical considerations; further clustering etc. should be left over to the classifier.

If we consider two of the most influential prosodic models from this point of view, the ToBI model and the IPO model, then both are too much ‘in between’: both introduce a special layer of representation, that is on the one hand not abstract and functional enough because quite often, a unit has no clear-cut functional linguistic counterpart; on the other hand, the units of description are too abstract, too fare away from phonetic reality and from the signal itself and by that, they are not the best features a classifier should use. This statement holds for other prosodic models as well; it is actually corroborated by the recent use of prosodic information for automatic dialogue systems in different research groups. The common ground in all these models and in ASU as well is thus simply, in terms of the ToBI-approach, the stars and the percents, i.e., a functional modelling of accent and boundary position - plus, of course, of some other phenomena as, for instance, of questions vs. non-questions. We will demonstrate this position with our own functional annotation in our ‘prosodic model’ that is used in the VERBMOBIL end-to-end ASU system. In addition, we will elaborate further on the prospective use of prosody in other domains of ASU, as, for instance, the classification of paralinguistic phenomena as emotion with the help of prosodic features, and on the constraints such a use puts forth for prosodic modelling.

 

GToBI - a phonological system for the transcription of German intonation

S. Baumann 1, M. Grice 1, R. Benzmüller 2

Universität des Saarlandes, Saarbrücken 1; G Data Software, Bochum 2

GToBI is a set of conventions for labelling the phonological structure of German intonation with the aim of being easy to learn, reliable, and adaptable for different labelling purposes. It was first developed in 1995 by researchers from Saarbrücken, Stuttgart, Munich and Braunschweig with a view to facilitating the exchange of prosodically annotated data.

GToBI can be regarded as a tool for investigating Standard German intonation. It is an adaptation of the English ToBI system (E-ToBI), which has its roots in autosegmental-metrical phonology. GToBI is able to capture distinctions drawn in the auditory literature on German intonation (e.g. von Essen 1956, Moulton 1962, Pheby 1975, Kohler 1977, Fox 1984) as well as in later autosegmental-metrical studies (e.g. Uhmann 1991, Féry 1993, Grabe 1998), and has been applied to spontaneous and read corpora (Grice et al. 1996, Reyelt et al. 1996). It has been recently modified in order to make the system phonetically more transparent and therefore easier to learn, and to incorporate recent advances in intonational phonology.

GToBI consists minimally of three label tiers: tones, break indices, and words. On the tonal tier the perceived pitch contour is transcribed in terms of pitch accents and boundary tones, with diacritics for pitch range modifiers such as downstep ("!") and upstep ("^") placed immediately before the affected tone. The tone and text tiers are related via the lexically stressed syllable of an accented word. This relation is expressed on the tone tier by a star "*" after the tone occuring in the accent (e.g. L+H*). The GToBI tonal inventory comprises two monotonal (H*, L*) and four bitonal pitch accents (L+H*, L*+H, H+L*, H+!H*), and edge tones which are peripheral to minor (intermediate) phrases (L- or H-) and major (intonation) phrases (L% or H%). The intermediate phrase edge tone, or ´phrase accent´ (Grice, Ladd & Arvaniti (to appear); Grice & Benzmüller 1998), may occur on postnuclear lexical stresses as well as at the phrase edge. In the new GToBI the location of the phrase accent is explicitly captured.

The automatic upstep rule, which raises the pitch range after an H- phrase accent in the English ToBI and earlier GToBI, has now been dispensed with, and instead upstep is marked with a "^" diacritic. Furthermore, redundant boundary tone symbols have been deleted. The new inventory is as follows: L-% (low fall, formerly L-L%), L-H% (rise to mid), H-% (high level contour, replacing the counter-intuitive sequence H-L%), and H-^H% (high plateau with a final rise, formerly used without the upstep diacritic).

The break index tier and the tonal tier are closely related, since each type of boundary tone corresponds by default to a given break index. In the default case, "3" and "4" coincide with intermediate phrase and intonation phrase boundaries respectively. These default correspondences are not manually transcribed. GToBI distinguishes between two mismatches in tonal and rhythmic structure: "2r" marks a rhythmic break with tonal continuity and "2t" stands for a tonal break with rhythmic continuity. Break indices below intermediate phrase level are not dealt with.

The GToBI tonal inventory (pitch accents and boundary tones) is based on perceptually defined categories which are aligned with specific F0 events, such as peaks or valleys. This means that native speaker knowledge of the tonal system is combined with information which can be automatically derived from the speech signal.

Given GToBI's phonological nature, labelled data can be used to investigate the relation to other linguistic levels such as (morpho-)syntax, pragmatics, and semantics.

References

Essen, O. von (1956). Grundzüge der hochdeutschen Satzintonation. Ratingen: Henn.

Féry, C. (1993). German Intonational Patterns. Tübingen: Niemeyer.

Fox, A. (1984): German intonation: An outline. Oxford: Clarendon Press.

Grabe, E. (1998). Comparative Intonational Phonology: English and German. (MPI Series in

Psycholinguistics 7). Wageningen: Ponsen and Looijen.

Grice, M. & R. Benzmüller (1998). Tonal affiliation in German falls and fall-rises. Poster presented at the 5th Conference on Laboratory Phonology, York.

Grice, M., M. Reyelt, R. Benzmüller, J. Mayer & A. Batliner (1996). Consistency in Transcription and Labelling of German Intonation with GToBI. Proceedings of the Fourth International Conference on Spoken Language Processing. Philadelphia. 1716-1719.

Grice, M., D.R. Ladd & A. Arvaniti (to appear). On the place of phrase accents in intonational phonology. Provisionally accepted for Phonology 17.2.

Kohler, K.J. (1977). Einführung in die Phonetik des Deutschen. (Grundlagen der Germanistik 20).

Berlin: Schmidt. (Revised 2nd edition 1995).

Moulton, W.G. (1962). The Sounds of English and German. Chicago: University of Chicago Press.

Pheby, J. (1975). Intonation und Grammatik im Deutschen. Berlin: Akademie-Verlag.

Reyelt, M., M. Grice, R. Benzmüller, J. Mayer & A. Batliner (1996). Prosodische Etikettierung des

Deutschen mit ToBI. In D. Gibbon (ed.) Natural Language and Speech Technology, Results of the third KONVENS conference. Berlin, New York: Mouton de Gruyter. 144-155.

Uhmann, S. (1991). Fokusphonologie. Eine Analyse deutscher Intonationskonturen im Rahmen der nicht-linearen Phonologie. Tübingen: Niemeyer.

 

 

About Speech Overlaps: Prosodic Cues contribution in predicting a Change of Speaker

R. Bertrand, R. Espesser

Laboratoire Parole et Langage, 29 avenue R. Schuman, 13621 Aix-en-Provence, France

This paper concerns a specific phenomenon inherent to the oral communication: speech overlaps. Whether many studies have shown the role of prosodic cues in turn-taking exchanges, they have focused their interest on “smooth transitions”, that is without overlaps.

With the recent interest for the interactive speech the emergence of this kind of study appears naturally. From an interactive standpoint, the speech overlap is in fact crucial because it implies such important notions as involvement, dominance, negotiation and cooperation. In the same way, these notions involve to consider speakers (who have often been excluded from the analysis) and their interactive roles in exchanges. Apprehending speech overlaps can illustrate a new theoric view in which dynamic analysis (process) is substituted to a static analysis (system).

This kind of experiment requires several channels for voice (avoiding overlaps on a single track). 12 dialogs constitute the corpus: in 6 dialogs, subjects are equipped with laryngophons. They did not have any specific instruction (no topic discussion). They could talk only if they wish to. In the others 6, subjects are equipped with microphones. They simulated a telephonic situation and they had to respect a specific interactive scenario.

At this stage of the work, we’ve treated the first 6 dialogs. The choice of the analysis unit (the phonatory group) has been determined by its formal (physiologic event) and its easily detectable character (the analysis is based on an automatic process in terms of detection of label and pattern).

Three phases around speech overlaps were selected (P, I and C corresponding to the part of phonatory groups before, during and after overlaps).

We tested the speaker change as a first variable: does the selected phases duration play a role in predicting a speaker change? In the statistic analysis, the partial T-test shows that durP (duration of P) and durI appear to be valuable predictors (T-statistic: a -1.340353; durP -3.661496; durI 3.297153). From the Null model, we can see that durP and durI are important as linear predictors (in deviance table, Pr (Chi) durP=0.000161; durI=0.000670).

From these first results, we can formulate hypothesizes related to the global functioning of speakers in interaction. There is for example a greater chance of speaker change when P is short and I is long. In other words, there are more chances of speaker change when speakers are interrupted early. Several reasons can explain this point: it’s easier to take the turn if the other didn’t have sufficient time to get involved in his own speech. The interrupter can consider it rightful to intervene in speech especially with the presence of a pause which can be the signal of a potential transitional place for the two speakers. Speech overlap can be considered too as resulting from a longer reaction time for the interrupter. To give another example, chances of speaker change decrease when P is long and I is short. This can be explained by the specific activity of listeners in interaction. We interpret short speech overlaps as back-channel signals which main function is in fact to manifest there is an active listening to the speaker (it confirms also that these signals don’t interrupt the main speaker).

At this point of the study, we are testing the role of the silent pause before P according to phases (P and I) duration. The role of the melodic cues in this speaker change will be tested too.

Our long term purpose is to compare the two types of corpus and to determine an eventual specific functioning of overlaps speech according to the difference of the interactive situations.

The Effect of Accentuation on Vowel Recognition

B. Braun, J. Koreman, J. Trouvain

University of the Saarland (UdS)

This study focusses on the phonetic effects of vowel accentuation and its modelling in automatic speech recognition (ASR). Hyper- and hypospeech (H&H) theory ([1],[2]) describes the phonetic effects of deaccentuation in terms of undershoot of the articulatory target, which can show itself acoustically in shorter vowel durations, lower intensity, and greater spectral tilt ([3]). These effects lead to inhomogeneity of the vowel models and to lower recognition results. Other sources of variability in the vowel realisation, like speaking rate, position in the utterance, etc. are ignored in this study.

The aim of this paper is two-fold: to show that the use of phonetic knowledge to model accentuation can improve vowel recogniton to verify traditional phonetic “knowledge” - often based on controlled experiments using read speech - by means of ASR

Three experiments were carried out, using the prosodically labelled PhonDat database of German spontaneous speech. It provides four levels of accentuation: 0 for ‘unaccented’, 1 for ‘partially accented’, 2 for ‘accented’ and 3 for ‘reinforced’ vowels. These levels constitute sentence level prominence - not lexical stress (these are sometimes used identically in the literature). In the baseline experiment one hidden Markov model (HMM) was trained for each vowel, i.e. accentuation was not modelled. Second, separate HMMs were trained for the accented (levels 1, 2, and 3) and unaccented (level 0) vowels. In a third experiment, which is underway, the effects of undershoot are modelled by deriving HMMs for unaccented vowels from those for accented vowels. Similar experiments were carried out for different speaking rates ([[4],[5]), where exit probabilities were increased in order to shorten sounds. As is the case for speaking rate, accentuation does not affect all vowels equally; also, different parts of the vowel are affected differently. By comparing the models for accented and unaccented vowels from experiment 2, the generalisability of traditional phonetic knowledge to spontaneous speech can be evaluated. It is expected that the duration of the middle state of long vowels, which largely models the steady state, is strongly reduced (additionally leading to spectral reduction, conform to H&H theory). The different extent to which undershoot applies to the various vowels leads to different adaptations of the models’ state transition probabilities. Within a single vowel model the exit probabilities of different states are affected differently. HMMs for unaccented vowels will be derived from accented vowel models by phonetic rules as well as data-driven.

In many approaches to ASR the variation in the signal (also that caused by accentuation) is dealt with by using multiple mixtures per state. These can neither account for variation in duration, nor can they provide information about the degree of accentuation. This, however, can be used to solve lexical ambiguities, to infer the syntactic category, or decode information packaging.

References

[1] Lindblom, B. 1963. Spectrographic Study of Vowel Reduction. In: Journal of the Acoustic Society of America 35. pp 1773-1781.

[2] Lindblom, B. 1990. Explaining Phonetic Variation: A Sketch of the H&H Theory. In: W.J. Hardcastle and A.Marchal (eds.) Speech Production and Speech Modelling. pp 403-439.

[3] Sluijter,A.M.C., V.J. van Heuven. 1996. Spectral balance as an acoustic correlate of linguistic stress. In: Journal of the Acoustic Society of America, 100 (4). pp 2471-2485.

[4] Morgan,N., E. Fosler, N. Mirghafori. 1997. Speech Recognition using on-line Estimation of Speaking Rate. In: Proceedings of Eurospeech (1997). pp 2079-2082.

[5] Siegler, M.A., K.M. Stern. 1995. On the Effects of Speech Rate in Large Vocabulary Speech Recognition Systems. In: Proceedings of ICASSP (1995) Vol. 1. pp 612-615.

 

Security of the information system for statistical analysis of laboratory results

L. Byczkowska-Lipińska, W. Gryglewicz-Kacerka

The College of Computer Science, Lilip@ics.p.lodz.pl

Institute of Computer Science, Technical University of Łódź, Wkacerka@ics.p.lodz.pl

Medical laboratory haematological analyses are an important tool to adequate diagnosis and therapy. The goal of these analyses is to measure the content of some components of the blood like; haemoglobin, hematocrit, erythrocytes, leukocytes and thrombocites. The other analyses are performed to measure some chemical components like: carbohydrates, protein, cholesterol etc. The methods of measurement of these components can be divided into: definitive, referential and routine. The important element of each analysis is to evaluate the results in respect of reliability.

The first international test of measurement of the haemoglobin concentration was organised in 1962 by National Public Health Institute of the Netherlands. In Poland haematological test of the laboratory results reliability has not been performed due to the high cost and the short lasting control materials. The most common way of analyses performing was using the traditional manual methods and low quality equipment, which caused high measurement errors. Now the situation has changed because of both importing and manufacturing of the contemporary measuring equipment.

Now all haematological tests are organised by the Laboratory Diagnostic Department of Medical Academy of Lodz. Obtained measured results from all laboratories which participating in the tests are send to Medical Academy of Lodz.

To analyse measured results the data base system was designed. The system was designed with the use of the structure analysis. The essential model of the system contains two main components – the environmental model and the behavioural model. On the end of designing the model of the data base system contains a full specification of functions which should be performed in order to satisfy the users requirements. The model describes processes used by the database system. The next may be creating the implementation model. This model defines the users interface.

In analytical laboratory there is need to hiding sensitive information from unauthorised users. The most straightforward way to protect data secrecy are the use of cryptography, identifications, access rights, interference control and material protection. The identification can be done by method uses passwords as key to the system and then each employee has his own unique name and password. The name and the password may be given by secured text or by the voice characteristic.

Usefulness and effectiveness of the database systems are when we can quickly automatic search all in it. This process may be made quick by speakers and identify they voice. In many problems of medical diagnosis as well as therapy the evaluation of the distorted speech signal quality is necessary.

 

Integrating different prosodic systems in speech synthesis

N. Campbell

Speech synthesis has long been a testing ground for different models of phonetic and prosodic structure. Recent developments in the technology have brought the synthesised speech much closer to that of the human original, but have become correspondingly more exacting in their tests of the underlying models of speech information. This talk will describe an extreme form of speech synthesis, which makes no use of signal processing and relies instead on a large and representative corpus of source units. It will show that the fine details of speech production can be well modelled by adequate labelling of higher-level prosodic and phonetic characteristics, and will conclude by showing that the triad of speech control must include phonation-style as well as prosody in order to capture the full range of variation in meaning carried by the voice.

 

Global Alterations to Prosody in Synchronous Speech

F. Cummins

Department of Computer Science, University College Dublin

What are the dimensions of prosodic control available to speakers? One suggestion (Lindblom, B. 1990) is that speakers vary their speech along a dimension of hypo/hyper-articulation, as evidenced, e.g. in the manifestation of prominence. The H-H dimension can be considered as one high-order control variable in production. We seek to characterize changes to speech which result from the alteration of other high-order prosodic variables, thereby gaining insight into the underlying dimensions of control.

The present study investigates speech obtained when speakers are constrained to read a familiar text in synchrony with another person. Under these circumstances, speakers make drastic alterations to their prosody. Changes to macroscopic timing and to intonation are particularly noticeable, and are shared by both speakers. Some obvious characteristics are a relatively slow speech rate and an intonation contour stripped of expressive flourishes. Despite the magnitude of the changes observed, the task of speaking in synchrony with others is both familiar (prayer, recitation, etc) and relatively simple.

We present results from an experiment with English speakers, comparing speech elicited when reading alone with speech elicited when reading in synchrony with another person (synchronous speech). Readings done together with a recording of another subject are also included as a control. We seek to characterize the principle changes to prosody under these novel experimental conditions. Measurements of speech rate, pause behavior and metricality will be presented, along with the main intonational characteristics of synchronous speech.

One interpretation of the global prosodic changes seen in synchronous speech, is that speakers are making just those changes which render the resulting speech maximally predictable for their co-speaker(s). This strategy would help to explain the undeniable success observed in attempting to synchronize one’s speech with another person. If this interpretation is substantiated, the modified speech has the potential of providing information pertinent to improving the intelligibility of synthetic speech.

References:

Lindblom, B.E.F. (1990) ‘Explaining phonetic variation: a sketch of the H&H theory’ in W.J. Hardcastle, A. Marchal, eds, {\em Speech production and speech modelling}, Kluwer, Dordrecht, 403-439.

 

Multiresolution Speaker Recognition - from Short-Term to Long-Term Analysis

A. Drygajło, M. Arcienega

Signal Processing Laboratory (LTS), Swiss Federal Institute of Technology Lausanne (EPFL)

Gaussian mixture models (GMMs) and ergodic hidden Markov models (HMMs) have been successfully applied to model short-term acoustic vectors for speaker recognition systems. Prosodic features are known to carry information about the speaker’s identity and they can be combined with the short-term acoustic vectors in order to increase the performance of the speaker recognition system. However, as prosodic features vary slowly, lower sampling rates can be used to represent them.

In this paper, a new flexible statistical approach, based on HMMs and multiresolution wavelet analysis coefficients, for modeling speakers is presented. This new approach is capable of simultaneously modeling statistical distribution of the coefficients in each subband, the inter-subband correlations and the variation in time of coefficients within a subband. This new approach is also capable of modeling, in a unifies framework, the short-term acoustic vectors and long-term prosodic features.

  

Correlation of perception and production of intonation contours in statements and questions in hearing impaired children

W. Gonet, Anita Trochymiuk

UMCS, Lublin

Correlation between perception and production on intonation contours was studied in 25 hearing impaired children. In the first stage of the experiment, the children heard sentences with intonation contours of varying type and Fo range, and were asked to determine whether the utterance they heard was a statement or a question. In the second stage, intonation contours of elicited statements and questions were recorded and analysed. Finally, correlation between the results obtained in perception and production was studied and compared with the degree of hearing loss.

 

Modelling Intonational Variation in English: The IViE System

E. Grabe, B. Post, F. Nolan

Linguistics Department, University of Cambridge, Sidgwick Avenue, Cambridge CB3 9DA

British English is characterised by a considerable level of intonational variation. For instance, compared to Southern Standard British English, Northern Irish intonation patterns appear to be upside down. This variation is the topic of a research project at the University of Cambridge (Intonational Variation in English, or IViE, Economic and Social Sciences Research Council award R000237145 to E. Grabe and F. Nolan, 1997-2002).

In the project, speech samples from 8 varieties of British English are transcribed using the IViE system (Grabe et al., 1998). The IViE system is modelled on the ToBI system (Silverman et al. 1992), but unlike ToBI, which is intended solely for the transcriptions of standard varieties of English, the IViE system allows the user to produce directly comparable transcriptions of several Ônon-standardÕ varieties of British English in one single transcription system.

In practice, the IViE system differs from ToBI in two respects: the tonal inventory, and the number of tiers available to the transcriber. Changes to the tonal inventory were made to allow for comparable transcriptions of more than one variety of English in a single transcription system. Unlike the original ToBI, which offers a finite set of labels which account for one particular variety of English (i.e. the so-called 'standard'), IviE offers a pool of labelling options from which transcribers can choose a subset of labels for each variety they investigate. The IViE labels themselves are based on phonological analyses of English intonation by Gussenhoven (1984) and Grabe (1998).

Secondly, the IViE system offers two new transcription tiers, the rhythmic and the pitch movement tier. The new tiers are intended to increase the transparency and replicability of the labels on the tone tier. In essence, they permit a step-by-step breakdown of the process which leads to a specific tonal transcription. In English, this process begins with the identification of rhythmically prominent (stressed) syllables because the pitch movements transcribed on the tone tier are anchored to these syllables. On the rhythmic tier, the left edge of a rhythmically prominent syllables is marked with "<" and the right edge with ">". The second step involves the identification of rhythmically prominent syllables which are associated with pitch movement (accented syllables), via inspection of the fundamental frequency trace and careful listening. The pitch movement is then transcribed on the pitch movement tier which has heuristic rather than linguistic status. It allows labellers to make a record of the impression of a particular pitch movement which, combined with other information, leads them to assign phonological labels to a contour at a later stage. The pitch movement tier makes that decision-making process accessible to users of IViE transcriptions.

In our contribution, we will describe the structure of the IViE system, and illustrate its application with examples from British English. Additionally, we will show how our methodology can be applied to languages other than English.

 

References

Grabe, E. (1998). Comparative Intonational Phonology: English and German. Max-Planck-Institute for Psycholinguistics, Nijmegen, The Netherlands.

Grabe, E., Nolan, F., and Farrar, K. (1998). IViE - a Comparative transcription system for intonational variation in English. Proceedings of the 5th Conference on Spoken Language Processing (ICSLP), Sydney, Australia.

Gussenhoven, C. (1984). On the Grammar and Semantics of Sentence Accents. Dordrecht: Foris.

Silverman, K., Beckman, M. E., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., and Hirschberg, J. (1992). ToBI: a standard for labeling English prosody. In Proceedings of the Second International Conference on Spoken Language Processing, 2: 867 - 70. Banff, Canada.

 

Derivational analysis by analogy for Polish in Machine Translation

F. Graliński

student at the Faculty of Mathematics and Computer Science, Adam Mickiewicz University

Analysis by analogy aims at recognising word-formations that are new to a recognising machine by means of analogy to those which have already been recognised. This type of analysis has so far been applied for Polish in inflectional analysis. An extension of the method applied for both inflectional and derivational analysis is put forward. The concept finds its implementation in Machine Translation, where the proper recognition of a Polish word-formation is verified by the possibility of determining its English equivalent(s).

 

 

Polish databases for technological applications

S. Grocholewski (1), G. Demenko (2)

(1) Technical University, Poznań, Poland; (2) Adam Mickiewicz University, Poznań, Poland

Speech databases are always designed for specific purpose, which determines its contents and structure. Databases designed for the needs of the automatic speech and/or speaker recognition must be representative for particular language with respect to any linguistic aspects, comply with specified phonetic-acoustics criteria and adequately represent individual variables in the speech signal.

Not like in English language, speech databases for Polish were first created only in the late 90’s. The first database was CORPORA - developed in the Computer Science Institute of the University of Technology in Poznań. It comprises more than 16 thousand of records, which include the speech of 45 persons. Each person uttered 365 phrases such like: numerals, alphabet letters, control commands, names and short sentences. The dictionary was based on the requirement formulated during one of the local conference of the Polish phoneticians, who required the text to include ca. 1000 diphones of the Polish language.

As this database was developed with a view to designing applications that use automatic speech recognition, the records were made in conditions similar to the assumed, i.e. in non-muffled rooms, with working computers, etc. The speakers came from various regions of Poland. All records were semi-automatically segmented and labeled. Apart from the sound files, the CD included related text files with time information about the segment boundaries and the phonetic meaning. This CD also included software allowing find items that match various criteria, e.g. phoneme sequence.

Another database was built at the Fundamental Problems of Technology Institute of PAN and it was developed by dr R.Gubrynowicz within the BABEL project. Its content is similar to the CORPORA database, however the records were made in studio conditions. The phonetic record database by prof. W.Jassem that comprises records of isolated words was made in similar conditions.

At the Wroclaw University of Technology, the works are currently held on a record database in the frames of the SpeechDat(E) project, being an equivalent of the record database for western languages that is SpeechDat(M).

Specialized record databases designed for automatic speech recognition systems for the blinds, and the database for the works on voice recognition systems have been recently developed at the Computer Science Institute of the University of Technology in Poznań.

 

Modeling of the speech units for Polish

S. Grocholewski

Technical University, Poznań, Poland

In the paper the present state of the modeling of the basic speech units for Polish is presented. The models are evaluated on the base of the CORPORA - the first database for Polish (presented at Eurospeech'97 and ICLRE'98). The standard and the modified algorithms will be described along with the results of the recent experiments.

 

Voice of oral educated deaf children 20 years later: A study of pitch contours and tonal range variations

R. Gubrynowicz, M. Sieńkowska

R. Gubrynowicz, Institute of Fundamental Technological Research, Polish Academy of Sciences, Warsaw, Poland

M. Sieńkowska, Educational Centre for Deaf, Warsaw, Poland

The overriding consequence of prelingual deafness is a dramatically lessened ability to acquire language at a rate and level that is consistent with the expectations and demands of society. The large majority of deaf children achieve linguistic abilities that severely impact their academic and vocational achievement [1]. The dominant reason for this low achievement is the dramatically reduced amount of language input and thus language competence brought on by the hearing loss at precisely the time when the brain is most primed to take advantage of such information [2].

Many studies have shown that intonation information is crucial component of intended meaning conveyed in spoken language, and how variation in pitch range can indicate topic changes. More, a control in appropriate manner prosody is important to obtain normal sounding speech.

In this preliminary study, we tested the production and comprehension of tonal events in several groups of hearing-impaired and deaf children with reference to a group of control subjects. The first recordings were made with a group of 35 children – 6 profoundly deaf children (of ~10 years old) after 5 years of regular oral language education by a new experimental, interactive method without the use of sign language, 9 profoundly deaf children of similar age, after 5 years of language instruction by traditional method (based on combination of sign and oral education), 11 hearing-impaired children and a reference group of 9 normal children (10-14 years old).

The main, experimental group was composed by six families who wanted their deaf children to learn to talk and acquire the communication and social competence they needed to succeed in regular education, without the use of sign language. The children from the experimental group were trained with a special attention to learn them to control prosody in appropriate manner [3]. This is depending on proper rhythm, tempo, accent, intonation and stress. Fundamental frequency control is needed to obtain tonal characteristics in accord for language used for oral communication.

All children have recorded several lists of sentences grouped as declarative without supplement part, declarative with supplement part, wh-questions (mainly one-word), interrogative sentences without wh-words, imperative sentences, a sequence of one-word (of two-, three- or four-syllables) sentences spoken in contrasting moods, i.e., indicative, interrogative and imperative mood. Every list was composed of 10 items. The speech material was complemented by a recording of reading aloud, a short, well known Polish children’s verse “The lazy boy” (of J. Brzechwa – “Leń”) composed of 11 sentences which 7 of them are rhetorical questions. The recording session was closed by a semi-spontaneous speech item. Children were asked to tell the fairy tale from a series of pictures. No time was given for preparation, the children were not prompted at any time.

20 years after, some of the people who participated in the first recordings were asked to record again the same speech material with the exception of the picture story, which was changed for a short monologue about speaker’s life and his/her professional activity. In this session participated:

The recorded material was processed by the signal analysis program Praat [4]. This program is optimized for accurate phonetic measurements with a user-friendly interface. With this software program, precise pitch tracking and any kind of segmentation and labelling is particularly easy to achieve. A special script program was written by one of the authors to automatize pitch tracking and graphic representation of the results of measurements in Hz and in semitones as well scale.

Our study was aimed at the tonal level of speech of deaf children trained in oral communication and of their speech at adult age. The investigations were focused mainly on the group of six profoundly deaf children (<90 dB), which were recorded again, 20 years after, as adults. Some references were done to the results obtained for other group of speakers. The reference data are inferred from read normal speech and involve pitch contour characteristics defined by prompted texts to be read. So, the first step of analysis was to compared the intonation contours and to investigate whether pitch contours are correctly employed by hearing-impaired and deaf children in the production of declarative, interrogative and imperative sentences. Especially, the set of one-word sentences shows clearly if speaker is able to apply correctly intonation function in sentence type disambiguation, because for this kind of sentences spoken in three different moods the pitch contour is the only clue employed for sentence type distinctions. The results indicate that this test was the most difficult for deaf children. In the paper will be presented detailed results of pitch contour analysis for all groups of speakers and some specific characteristics of vocal production of deaf talkers will be discussed.

References

[1] Stone P. (1999), Revolutionizing Language Instruction in Oral Deaf Education, XIV Int. Congress of Phonetic Sciences, San Francisco,

[2] National Institutes of Health. (1993). NIH Consensus Statement: Early identification of hearing impairment in infants and young children. Washington, DC: Public Health Service.

[3] Sieńkowska M., An attempt to note pitch melody in speech of hearing-impaired children (in Polish)”, [in:] Nowy Swiat Ciszy, 1998,3, 30-32.

[4] Boersma P., Weenink D. , Praat 3.8.28 – A system for doing phonetics by computer, Web site: www.fon.hum.uva.nl/praat

Is intonation linear?

U. Gut, D. Gibbon

Fakultät für Linguistik und Literaturwissenschaft, Bielefeld

It has been claimed that formal models of F0 trajectories can be assigned to two opposing categories: ‘contour interaction models’ and ‘tone sequence models’ (Ladd 1983; Hirst & DiCristo 1998); we will refer to these as CI and TS models respectively. In CI models, the height of successive pitch accents in an utterance is modelled in relation to an overall declination line. Evidence for this has been adduced, for instance, from Danish data (Gronnum 1992). In TS models, the F0 at a given position (e.g. an accent) is defined as a local function of the immediately preceding position; the approaches of ‘t Hart & al. (1990) and of Liberman & Pierrehumbert (1984) fall into this category, and evidence from Dutch and English is given. However, the TS view has been contested for both English (Nolan, 1995; Grabe, 1998) and German (Gut, 1995). Nolan claimed to show for English that after an extra pitch height of emphasis the general declination line is not reset as would be predicted by a TS model, and thus a CI model is indicated. Gut confirmed this finding for English; in her German data, however, a resetting of the general intonation contour occurs in some cases and a TS model is apparently required. Grabe’s data on final lowering suggest that the notion of downstep (Liberman & Pierrehumbert 1984) might be better explained as declination, which would apparently support the CI model.

We look at the phonological mechanisms and phonetic functions required to model the data, and how that all models proposed to date contain both TS and CI elements, and that certain proposed cases of overlay can in fact be accounted for straightforwardly in a TS model. We explicate the notion of TS conventionally in terms of finite state automata, and explicate CI as temporal overlap and synchronised parallel patterning of a kind which can be extended to autosegmental patterns in general (Kirchoff 1994, Carson-Berndsen 1998). We operationalise the model in a prototype implementation and show that the data described above can be handled by a single homogeneous processor.

 

Recent developments in speech processing technologies for prosody

W. Hess

IKP, Bonn University, Germany

This paper deals with some acoustic aspects of prosody: measuring prosodic features, above all fundamental frequency; modifying the prosodic features of speech signals; separating segmental and prosodic features by delexicalization of a speech signal while maintaining its prosodic features, and, finally, selected applications in speech understanding and speech synthesis.

Although many algorithms for pitch determination have been developed, there is no universal algorithm that fits all purposes and all signals. To obtain an intonation contour one may rely on the robust periodicity-detection algorithms that yield frame-by-frame estimates; to manipulate signals, however, more precise (albeit less robust) algorithms are necessary that track a signal period by period. Two algorithms are discussed in more detail. One of them applies a high-order linear prediction and computes the impulse response of the obtained LP filter with the phases of all partials set to zero. Apart from its robustness this algorithm allows us to obtain the impulse response with a high sampling frequency and hence high precision of the T0 estimate. The other algorithm applies a correlation technique, again using linear prediction, to obtain estimates for the instants of glottal closure in the signal.

When judging the prosody of a (natural or synthetic) utterance, it is difficult for a listener to separate segmental and prosodic properties of the signal and to separate the prosody of an utterance from its contents. In speech synthesis, where prosody generation usually is a module of its own, a tool for delexicalization is desirable that permits testing an utterance?s prosody apart from its other properties. A couple of such tools are presented: delexicalization by scrambling a speech synthesizer, and extraction of the prosodic properties of a speech signal by replacing each pitch period with a standard signal that retains only its amplitude and periodicity.

Being able to modify amplitude, duration, and pitch of a signal in the time domain without modifying the spectral properties has become one of the most important developments in speech synthesis; here the well-known PSOLA algorithm, developed before 1990, initiated the implementation of speech synthesis schemes operating on chunks of natural speech and greatly improved the intelligibility of such systems. Yet PSOLA has its limitations; above all the quality of the synthetic signal degrades when F0 modifications exceed the range of ˝ octave. There are alternatives more tolerant to major changes, such as modifying LP residuals, or the recently developed harmonic-plus-noise model.

In speech understanding prosodic information gives valuable additive features to the speech recognition process: detecting word and sentence accents, determining the sentence mode, detecting contrastive focus, and giving hints about the syntactic and semantic structuring of an utterance are among these features. An example is given from Verbmobil, where the Erlangen group has developed a prosody module that processes many of these features.

In concatenative speech synthesis using diphones or other units of comparable size as basic elements, prosody has to be completely modeled (by rule or other algorithms, such as neural networks). Yet no model is as good as natural speech, and in most cases the highest degree of naturalness is achieved when the output signal of a speech synthesis system is not manipulated at all. In addition, recent progress in reproductive speech synthesis has made this technique a direct competitor to TTS systems for limited domains. This is a strong motivation for synthesis directly from a large corpus. Again the first implementations date back before 1990, when CHATR?s development was started whose philosophy was to abandon signal manipulation in favor of selecting the most appropriate chains of elements from a large corpus. Units are selected by minimizing a cost function over a whole utterance, and many components of this cost function are prosodically motivated: duration, fundamental frequency, word position within phrase, sentence mode. An example from the experimental System BOSS (Bonn Open Synthesis Software) is discussed in some detail.

 

Parsing non-continuos phrases of Polish expressions in MT

K. Jassem

Faculty of Mathematics and Computer Science, Adam Mickiewicz University

It is often suggested that parsing free-order languages, like Polish, is easier to handle with dependency-type grammars. On the other hand, PSG-type grammars have their advantages over DGs; the most relevant for Polish being the existence of two well-formalised descriptions of that kind. Here, some algorithms for parsing Polish sentences are presented which compile the determinism of dependency-like parsing with the possibility of implementing the existing PSG descriptions.

 

Predicting Prosodic Parameters:

Evolutionary Parameter Extraction and Hybrid Neural Network, Rule Based Modelling

O. Jokisch, H. Kruschke

Dresden University of Technology

F0 contour and segmental durations of synthetic speech signals are parameterized by established quantitative models. The current research deals with:

- extending the linguistic input

- adapting the model parameters for new languages or speaker characteristics

- and reducing prosodic monotonies.

Data driven algorithms for prosody control enable the simple adjustment of prosodic parameters via training and generate variable contours for speech synthesis. Nevertheless, a strictly data driven approach like a neural network tends to local runaways and similar irregularities. This contribution introduces a hybrid neural network and rule based approach, which combines the advantages of well-balanced, quantitative models with the possibility to train the model parameters. Following aspects are discussed:

- hybrid architecture of prosody models and extraction of model parameters

- appropriate speech databases

- training a neural network (ANN) for predicting the Fujisaki model parameter (German)

- comparing results for ANN, Fujisaki model and hybrid ANN/Fujisaki model

For standardized databases hybrid neural network and rule based, quantitative models can be easily parameterized and adapted e.g. for multilingual applications. Problems to be solved are the mainly automatic extraction of model parameters from speech database and also the higher complexity of those models.

 

The Project of an Intonational Database for the Polish Language

M. Karpiński, J. Kleśta

Maciej Karpiński & Janusz Kleśta, Institute of Linguistics, Adam Mickiewicz University, Poznań, Poland

The study of prosody plays an increasingly important role in speech synthesis, automatic speech recognition and understanding, as well as in interpersonal communication analysis. However, the resources for research in these areas are still limited for East-European languages. With a view to partially filling this gap, the project of the Intonational Database for the Polish language (IDB) has been initiated.

The aim of the project is to construct an easily accessible database of Polish utterances, indexed according to their intonational, lexical, contextual and extralinguistic features. The entire database, including the signals, the traces of their intonational contours, and the controlling software, will be issued on a CD or DVD medium.

The software will enable the user to listen to the signals, to hear and view intonational contours, to read the comments and descriptions, as well as to perform index-based searches for required materials.

Five types of speech situations have been selected as sources for the recordings: (a) (semi)spontaneous conversations (e.g., about an emotion-loaded image), (b) discussions in a TV or radio program, (c) purpose-oriented conversations (e.g., booking a train ticket, asking the way, (d) reading artistic texts (prose and poetry), (e) spontaneous monologue on one of preselected topics (e.g., favorite food, movies).

The development of IDB poses a number of theoretical and practical problems. Some of them are (1) the segmentation of the collected materials, (2) the imposition of certain formal structures on the compiled corpus, (3) building a system of indexes to provide efficient search, (4) creating an effective and user-friendly interface.

At present, the project is in the stage of collecting, analyzing and classifying speech signals. A preliminary version of the database software (based on MS Access) has been prepared for testing and demonstration purposes.

The IDB is assumed to offer an open structure: It will be possible to add new examples as well as new categories of signals. The software itself allows for adding new indexes and new fields. A large corpus of high quality recordings of Polish speech will be a by-product of the project.

 

The prosodic expression of surprise and astonishment in jokes

M. Karpiński

Institute of Linguistics, Adam Mickiewicz University, Poznań, Poland

The prosodic features of the dialogic turns in jokes are interesting because of a number of reasons: (1) jokes are usually emotion-loaded texts; (2) the narration of the dialogues in jokes demands certain special skills from the speaker; (3) such dialogues may involve various characters who express their emotions in different ways; (4) the study of the suprasegmental features in such utterances may give us a notion of how people imagine the expression of certain emotions in speech of various human or animal characters.

An experiment was carried out in order to examine the prosodic expression and the perception of surprise and astonishment in a small set jokes told by Polish native speakers. Given the complexity of emotion-related issues in speech, only one class of emotions (i.e., surprise and astonishment) was selected for the study.

At the first stage, a corpus of over three hundred jokes was collected from a number of sources. Only a relatively small group of them met the demands, i.e., contained short dialogues, involved surprise and astonishment, were not based on "linguistic humor" (e.g., wordplays). The final set of four jokes contained six target utterances (turns or their parts) in which surprise or astonishment should have occurred. Thirty subjects were asked to learn the jokes and then to told them in a possibly natural way (as if to their friends). The jokes were digitally recorded and the target phrases or turns were cut off and transferred to the computer for further analysis. Since the initial decision whether a given utterance involved surprise and astonishment was based solely on the experimenter's competence in discourse analysis, another group of twenty subjects was asked to read the jokes and judge whether the target utterances really involved these or other emotions. At the next stage, one more group of subjects was asked to listen to the recordings of the target utterances pronounced by various speakers and to judge what kind of emotions they involve. In this case, the utterances or turns were deprived of their natural, dialogic context.

It occurs that there are at least a few optional methods of expressing the analyzed class of emotions at the suprasegmental level of speech. This can be attributed to the presence of certain more or less subtle emotions always accompanying the main one as well as to some speaker-specific factors. Although the prosodic and contextual cues can be very explicit, there is always some place for inter-speaker and intra-speaker variation in expressing as well as in detecting the emotions in speech.

 

Intonation of Hungarian questions and its prediction from text

I. Koutny, G. Olaszy

Institute of Linguistics, Adam Mickiewicz University, Poland

The intonation patterns in Hungarian questions vary in a great measure, depending on different features. Besides the two main categories (yes/no- and Wh-question) there are other question types and subtypes with particular intonation patterns. Sentence melody may also depend on the internal structure of the sentence and on the length of the question (one, two or more syllables). The analysis and synthesis of all these types will be discussed in the paper. The prediction of proper intonation and rhythm for TTS synthesis is performed on the basis of syntactic analysis. The melody of complex questions is built up from characteristic tone groups assigned to the sentence constituents. The experiments for finding the most relevant melody forms for every question type have been carried out by the Hungarian prosody composer and development tool. The final results can be successfully used in dialog synthesis and will be embedded into a TTS converter.

Syntax and Prosody: Case study of Hungarian

I. Koutny

Institute of Linguistics, Adam Mickiewicz University, Poland

The relation of syntax and prosody is controversial because not every information conveyed by the prosody has traces in the syntax. This paper will analyze the correlation of phrase structure and prosody structure with place and strength of stresses inside the phrases, and pauses on phrase boundaries in Hungarian. We argue that sentence intonation can be composed of special melody patterns (tone groups) assigned to the phrases. A dependency-based parser was developed in order to determine automatically the main constituents of the simple and long complex sentences and the relation between them making use of the topic-focus articulation and a valency dictionary. Some problems of dialog structure will be touched as well.

 

F0 contour generation in TTS system for Russian language

O. Krivnova

Moscow State Lomonosov University, Philological Faculty, Department of Theoretical and Applied Linguistics

In this paper the strategy and ways of F0 contour generation in TTS system for Russian language developed in Lomonosov Moscow State University are described. The system is based on two methods: concatenation of allophones’ waveforms and prosodic rules to control fundamental frequency, duration and intensity. The prosodic rules are part of the speech control module which carries out the interface function, bridging the gap between the output of the block of text linguistic processing and the input of speech signal generation module. As a result each segment (allophone) in a phrase being synthesized is attributed by at least two F0 values as its starting and ending points. Three and even more F0 values can be assigned to the phone if it is necessary.

The basic unit, for which the pitch contour is generated, is the intonational phrase (IP) – a coherent, grammatically organized fragment of the text to which one intonational model (abstract tune) is attributed. The type of intonational model for IP gets out as a result of the work of accent-intonation transcriptor and is fixed as an abstract prosodic marker. We distinguish 7 intonational models: 1 model of finality; 1 – non-finality; 3 – interrogative models (general, special, comparative questions); 1 –exclamation (or command). For all models the possibility of a different position of the intonational centre is taken into account. The formation of F0 contours for concrete phrases within the same intonational model is carried out in a separate submodule. The calculation of F0 curves is implemented in two steps: at first in a semi-tone scale with respect to the average pitch (reference line) of a speaker, then these values are transformed into Hz. The calculated curve settles down in a working area of the speaker voice range, the boundaries of which are typical for realizations of the chosen intonational model. The contour of the synthesized IP is formed as a result of concatenation of two types of tonal objects – tonal accents the main of which is nu-clear and tonal plateaus. The height and temporal alignment of tonal units are con-trolled by rules taking into account the intonation model itself, the rhythmical pattern of IP and its segmental make-up. To make it possible the preliminary coding of syllables is carried out which fixes such features as accent status, prominence level, posi-tion and sound structure. All tone rules are hand-written and based on phonetic and acoustic analysis of read-aloud texts.

 

Prosodically Aided Word Sense Disambiguation in Polish-English Speech Translation,

G. Krynicki

Institute of Modern Languages and Literature, Adam Mickiewicz University

This presentation reports on the influence that prosody of selected Polish ambiguous utterances may have on their interpretation and translation into English. A method for parametric description of the pitch curves coextensive with these utterances is discussed. On the basis of the pitch parameters obtained by the above method, the classification of the utterances is performed with respect to their interpretation and translation into English. Two approaches to the problem of classification are presented: Dynamic Time Warping and Discriminant Analysis. The DTW classifier provided 66.7 – 79.5% correct classification rate, DA classifier performance ranged from 82.5 to 97.5% of correct classifications, depending on the disambiguated word. Perspectives for the application of the prosodically aided word sense disambiguation in Spoken Language Translation are discussed.

 

Prosody control in a diphone-based speech synthesis system for Polish

S. Kula, P. Dymarski, A. Janicki, C. Jobin, Ph. Boula de Mareuil

Slawomir Kula, Inst. of Telecommunications, Warsaw University of Technology

Przemyslaw Dymarski, Inst. of Telecommunications, Warsaw University of Technology

Artur Janicki (Ph.D. student), Inst. of Telecommunications, Warsaw University of Technology

Carine Jobin, ELAN TTS

Philippe Boula de Mareuil, ELAN TTS

A text-to-speech (TTS) system for the Polish language is described, based on the concatenation of diphones using the Time Domain PSOLA technique, with appropriate prosody control. First of all, the general features of the system are addressed, then the prosody control algorithms.

The following issues will be discussed:

Text preprocessing of cardinal and ordinal numerals, fractions, real numbers, dates, time, acronyms and other abbreviations, internet-specific expressions, measurements and currency units, etc. A language independent approach to preprocessing will be presented.

Grapheme-to-phoneme conversion, based on the rules defined by M.Steffen-Batogowa (1975). The output consists of 37 phonemes, excluding silence. Construction of test sentences by prosody transplantation from natural speech. Construction of database with prosody parameters (pitch, duration) and linguistic data (part-of speech, stress and punctuation information) by analysis of the natural speech and the corresponding text.

Creation of the word- and phrase-based prosody model, basing on the knowledge database. Identification of the most common intonational patterns and definition of duration rules will be described. Method of verification of the proposed prosodic model will be described, generated and original prosody will be compared.

Construction of the Polish diphone database, using automatically generated logatoms. The method used for the voice selection, the acoustic database creation, and the validation of the diphone database will be described.

Implementation of the proposed prosody model using the created acoustic inventory and acoustic synthesis using the Time Domain PSOLA algorithm with pitch and duration control.

Results of the listening tests of the synthesised speech will be presented, with a comparison of intrinsic prosody, prosody copied from natural speech and the (local and global) prosody model.

Brain activity during the perception of synthesized speech

S. Lattner, K. Alter, A. D. Friederici, W. Ziegler, B. Maess, U. Oertel, Y. Wang

MPI of Cognitive Neuroscience, Max-Planck-Institut fuer Neuropsychologische Forschung,

During the past decades, there were remarkable qualitative improvements in the field of speech synthesis. Nevertheless, synthesized speech is still clearly distinguishable from natural speech due to a number of acoustic and particularly prosodic features.

The aim of the present study was to analyse the neural response of the human perceptual system towards natural and synthesized speech (with a focus in the question whether the perception of synthesized speech would lead to a higher cognitive processing effort in the hearer). Neural responses to artificial and natural speech were recorded, the brain activity quantized and localized by using magnetoencephalography (MEG).

MEG seems an appropriate means of investigation, because it is an on-line method, tracking neural acitivity in the range of milliseconds, furthermore, it allows a high spatial resolution in modelling the responses.

Method:

Three different stimuli were presented to the subjects:

1. Recording (16bit / 16kHz) of a German single word uttered by a male speaker = frequently presented standard stimulus.

2. Recording (16bit /16kHz) of a German single word uttered by a female speaker =target 1 , only rarely presented (p=0.14).

3. Synthesized stimulus on the basis of target 1, using MBROLA speech synthesis (German database d1)[1], the text2pho-script for SAMPA encoding [2] and for adaptation of pitch and intonation contour PRAAT [3] software. This synthesized speech signal served as target stimulus 2 (T2), p=0.14.

The stimuli were presented to the subjects in randomized order (number of deviant presentations = 500). Ten subjects (normal hearing native speakers of German) participated. In order to avoid biases due to expectation or attention, they were asked to watch a soundless film and to ignore the auditory stimuli that were presented via earphones.

The specific neural activation by the target stimuli was recorded (sampling frequency: 250Hz, no. of channels: 148).

Results:

Not only the localizations but above all the amount of activation for the deviants was compared in order to investigate whether the processing of synthesized speech demands a higher cognitive processing effort in the hearer’s brain. The ultimate results are still to be evaluated but will shortly be available.

Finally, by establishing the presented paradigm, one has not only an objective qualitative measure of speech synthesis systems in general (in the sense of how similar the hearer’s processing effort is compared to natural speech) but it offers additionally a method for evaluating the relative importance of various signal parameters.

References

[1] MBROLA : TCTS Lab, Faculté Polytechnique de MONS/Belgium, ttp://www.tcts.fpms.ac.be/synthesis

[2] text2pho: Insitut für Kommunikationsforschung und Phonetik Bonn

[3] PRAAT : P. Boersma & D. Weenink Insitute of Phonetic Sciences, University of Amsterdam, NL http://www.fonsg3.let.uva.nl/praat/

Classification of Polish idioms for the sake of MT

M. Lisoń

Faculty of Mathematics and Computer Science, Adam Mickiewicz University

The research aims at classification of Polish idioms that occur in Information Technology texts according to the way in which they should be automatically translated into English. It is assumed that two idioms belong to the same class if they correspond to the same category of Polish grammar as formalised in the POLENG system and if they are translated into English by the same rule.

 

Telephone Speech Database for Polish

W. Majewski, J. Sadowski, P. Staroniewicz

Institute of Telecommunications and Acoustics, Wrocław University of Technology

Evaluation of usefulness and effectiveness of systems for automatic recognition of speech and speakers requires to carry out suitable tests by means of representative database that has been worked out for a given language. Most of the available databases concern English. For Polish only specialised data bases exist that have been worked out within Corpora and Babel projects. The present speech database encompasses representative utterances of 1000 Polish speakers recorded over fixed telephone network according to the criteria set up for eastern European languages in the framework of EC-funded project SpeechDat(E). These criteria are in agreement with previously produced databases for western European languages within SpeechDat(M) and SpeechDat(II) projects, what permits a comparison of the R&D results obtained for a variety of European languages.

The recording platform, supporting collection of data from two simultaneous calls, includes a connection to EURO-ISDN basic access using AVM-A1 board, Pentium PC with hard disk and Windows 98 operating system. Automatic database acquisition program (ADA) controls ISDN access. Each speech file is accompanied by a description file in SAM format that contains the information on speech signal, on conditions and time of the recording, on the speaker and on the utterance content. The phonetic transcription is made according to SAMPA (Speech Assessment Methods Phonetic Alphabet).

The phonetic material fulfilling several statistical conditions contains various utterances recorded by each speaker during ca. 10 minutes lasting call controlled by computer server. The speaker answers questions and commands given by the telephone server and his/her utterances are automatically recorded on the hard disk. Each speaker produces 52 short utterances, that besides of the information on the speaker’s age and conditions of call contain, among the others, 12 phonetically rich sentences (taken from a set of 1536 sentences), four phonetically rich words (taken from a set of 1320 words), six application words, three spelled words, dates, time phrases, surname, city name and company name, isolated digits and sequences of digits and numbers, currency amount, answers to yes/no questions. Most of the utterances are read by the speaker from the provided prompt sheet and some of the utterances are spontaneous.

The 1000 speakers are a representative population of Poles selected according to the sex, age and geographical regions.

The produced database, after validation by an independent unit (SPEX), shall be distributed (on condition of payment) to interested institutions and companies by ELRA/ ELDA (European Language Resources Association and Distribution Agency) and/or Siemens.

 

WinPitch 2000: a tool for experimental phonology and intonation research

Ph. Martin

University of Toronto

WinPitch is a Windows based software program for real-time analysis and synthesis of speech. Although optimized for research on intonation, the software operates as a general-purpose speech analysis tool. The interface and the implementation of numerous functions to facilitate the work of prosodists in analyzing, segmenting, labeling and manipulating speech in small or large corpora. We present here some of the new features of version 2000, which includes a new user-friendly interface. The original fundamental analysis algorithm based on the spectral comb method has been modified and improved to include harmonic differences as well as harmonic frequencies, leading to an more reliable pitch tracking system. Furthermore, the implementation of the spectral brush method proposed recently by the author allows for the correct detection of Fo even in the presence of multiple harmonic sources in the signal (engine noise, musical instrument, second speech source…). This features is of course very important for practical field work on speech collected from various sources (Internet, TV and radio broadcast, etc.).

We will also describe how this tool can be effectively used to describe the intonational features of French, Italian and Portuguese in Toby or non-Toby theoretical approaches, and in particular how prosodic morphing (modification of prosodic parameters on natural speech by PSOLA type synthesis) can help demonstrate the pertinence of phonological features such as stressed syllables pitch contours and melodic contrasts. Specific labeling and segmentation tools allow for the elaboration of sophisticated prosodic database from read or spontaneous corpora.

We will finally show how the specialized functions of the program can be used for the alignment of text and speech in large speech corpus, thanks to the slow-rate speech synthesis, which allows the user to synchronize easily text and slow speech for alignment purposes.

 

Synthesizing French intonation from syntactic structure

P. Mertens

CCL, Department of Linguistics, University of Leuven

The paper describes the Mingus system for intonation generation in text-to-speech synthesis for French. It is an implementation of the intonation model by Mertens 1987, 1990, 1993, 1997, 1999.

Intonation generation is preceeded by morphological analysis, syntactic parsing, grapheme-to-phoneme conversion, and syllabification. The syntactic parse tree of the sentence, decorated with phonetic transcription, constitutes the input to prosody generation.

Successive processing steps will transform the parse tree into a tree consisting of prosodic units, to which tones are assigned. This process includes the following steps.

1. Syntactic constructions requiring particular intonation contours are identified, as well as sentence modality and focus, as indicated by punctuation. If the parser identifies parenthesis, this information is also used. “Prosodic markers” are added to input accordingly. These markers are initially stored as tags affecting substrings of the input: e.g. sentence modality spans the entire input string, whereas a parenthesis marker is positioned over the parenthesis only. In a later stage, when intonation groups have been formed, the prosodic markers are incorporated in the corresponding units.

2. Stress group formation is based on lexical category (word stress) and (syntactic) dependency relation. In this manner, clitic elements (such as clitic pronouns, determiners, etc.) are grouped with the non-clitic element they are governed by.

3. Stress groups may be merged into larger intonation groups in a recursive manner, descending the parse tree, thus taking into account constituency. Criteria for merging are syllable count and syntactic dependency.

4. Large constituents may be split into smaller ones, or may be reorganised under certain conditions. This once again involves dependency relations.

5. Boundary levels are associated with the resulting units.

The last unit of the sentence receives the maximum boundary, and decremented boundary level is passed on to the preceeding units. This procedure is applied recursively until all units have received a boundary.

6. Tones are attributed to the intonation groups, on the basis of boundary levels, and prosodic markers.

The resulting tonal representation consists of a list of intonation units, together with their phonetic transcription and their tonal units. This information is passed onto the contour generation and duration model. The output is a sequence of phonetic symbols with pitch targets attached to them. This information is sent to a standard speech synthesizer (MBROLA).

Pitch targets are specified for each tone as a list of (symbolic) pitch levels occurring at a relative position in the vowel carrying the tone.

 

Symbolic versus quantitative descriptions of F0 contours in German: Quantitative modeling can provide both.

H. Mixdorff

University of Applied Sciences, Berlin

In earlier studies by the authors a model of German intonation was developed which uses the quantitative Fujisaki-model for parametrizing a given F0 contour. The contour is described as a sequence of tone switches, major rises and falls, which are modeled by onsets and offsets of accent commands connected to accented syllables. Prosodic phrases correspond to the portion of the F0 contour between subsequent phrase commands. In the current study a recently developed automatic extraction algorithm for the model was applied to a ToBI-labelled radio news corpus of 48 minutes read by a single speaker. The agreement between the human labeller and the automatic procedure was examined on the accent and phrase levels. The corpus contains a total number of 13151 syllables. Of the 2498 syllables labeled as accented ('H*L','L*H', etc.) about 96% were found to be linked to accent commands, as well as 78% of the 859 syllables assigned boundary tone labels ('H%','L%'). 'Non-downstepped' accents exhibited a mean accent command amplitude of 0.28 against 0.21 for accents labeled as down-stepped.

The standard accent types 'H*L','L*H','HH*L' and 'L*HL' which account for 84% of the accent labels, can be identified by the alignment of the accent command with respect to the accented syllable, expressed as T1.rel=(T1-t.on)/dur; and T2.rel=(T2-t.on)/dur [T1: command onset time; T2: command offset time; t.on: syllable onset time, dur: syllable duration]. For type 'H*L', mean T1.rel and T2.rel are -42% and 85%, and for type 'L*H' 50% and 170%, for instance. A considerable number of accented syllables (N=317) were detected which had not been assigned any accent labels by the human labeller. Most of these cases were incidently accented word accent syllables in by default unaccentable functions words. 97% of phrase boundaries labeled with break index (BI) 4 and 57% labeled with BI of 3 are preceded by phrase commands. The mean phrase command magnitude in these cases was found to be 1.4 and 0.64, respectively. All phrase commands could be assigned to either of the categories. In conclusion we can state that the labeling accuracy of the automatic procedure is quite high on the accent level and can be successfully used to determine ToBI labels without the loss of quantitative information incurred by a purely symbolic representation. The detection of lower level phrase boundaries obviously requires the evaluation of additional features such as pausing.

 

Prosodic models and speech synthesis: towards the common ground

B. Möbius, G. Möhler, A. Schweitzer, A. Batliner, E. Nöth

IMS, Universität Stuttgart and LS Mustererkennung, Universität Erlangen-Nürnberg, Germany

Prosodic models have been extensively applied in speech synthesis. The situation in this particular branch of applied speech research is thus drastically different from the one found in automatic speech recognition and understanding. In the latter area the use of prosodic models has been rather occasional, for reasons that we have discussed in a parallel paper (Batliner et al.) in this Workshop. Obviously, there is a need for any speech synthesis system to generate prosodic properties of speech if the synthesis output is to sound even remotely like human speech. However, the necessity of synthesizing prosody has as yet not resulted in a generally agreed upon approach to prosodic modeling. This statement holds for the assignment of segmental durations as well as for generating F0 curves, the acoustic correlate of intonation contours. This paper concentrates on the use and usability of intonation models in speech synthesis. Which intonation models have been applied to synthesis? It has become customary to distinguish phonological models that represent the prosody of an utterance as a sequence of abstract units (e.g., tones) from acoustic-phonetic models that interpret F0 contours as complex patterns resulting from the superposition of several components. Besides these prevalent models at least three other approaches have been taken, viz. perception-based, functional, and acoustic stylization models. All of these approaches rely on a combination of data-driven and rule-based methods: they all systematically explore natural speech databases, but they vary in terms of what is derived from the analysis to drive intonation synthesis. For instance, acoustic stylization models represent intonation events either by continuous acoustic parameters (Taylor, 1998) or as events that are related to phonological entities such as tones or register (Möhler, 1998).

Intonation synthesis can be viewed as a two-stage process, the first aiming at representing grammatical structures and referential relations on a symbolic level and the second at rendering acoustic signals that convey the structural and intentional properties of the message. Intonation models differ drastically in terms of the interface that they provide between the higher linguistic components and the acoustic prosodic modules. At the same time, different synthesis tasks may require different interface designs. For instance, text-to-speech systems have to rely on the computation of linguistic structures from orthography, a level of representation that is particularly poor at coding prosodic information in many languages. Concept-to-speech systems, on the other hand, provide a direct link between language generation and acoustic-prosodic components. We will review the common ground between intonation models and the constraints put forth by speech synthesis.

Phonetically motivated modeling of prosody

B. Möbius

IMS, University of Stuttgart, Germany

(joint work with Jan van Santen, OGI/CSLU)

Prosody has an integrating function in the organization and production of speech by embedding semantic information, syntactic and morphological structure and the segmental chain in a consistent address frame (Levelt, 1989; Dogil, 2000). Temporal properties of the prosodic components crucially contribute to this integration. While this is self-evident for models of segmental duration, we argue that temporal control is also relevant for intonation modeling: Intonation models need to provide a precise temporal alignment of pitch (F0) contours with the segmental material.

Models of prosodic feature systems that are designed for speech technology applications such as text-to-speech synthesis rely on linguistic and phonetic expert knowledge, and frequently this knowledge is not readily available but has to be acquired by detailed phonetic studies. As a consequence, prosodic modeling also depends on the availability of large annotated text and speech corpora and on statistical methods to detect, learn and model relevant features. We will illustrate these points by discussing some of the problems encountered, and partially solved, in segmental duration modeling.

Accounting for temporal factors is one of the key motivations for the quantitative intonation model proposed by van Santen and Möbius (2000). This model computes F0 contours from a phonological representation of prosody, taking into account the effects of speech sounds, syllables and accent groups, and their respective durations. One of the starting points is the empirical observation that F0 peak location as well as the shape of accent curves depend on syllable structure and on the duration and structure of the accent group. This dependency is complex but regular and can be captured by a linear alignment model.

The alignment model determines the peak location by computing a weighted sum of the onset and rhyme durations of the stressed syllable and the duration of the remainder of the accent group. These three factors are assumed to exert different degrees of influence on peak location, quantified in terms of regression weights or "alignment parameters". Because no special status is reserved for the F0 peak, alignment parameters are also estimated for other characteristic points ("anchor points") along the accent curve.

This intonation model formulates several phonologically relevant hypotheses. It posits that F0 contours of accents that belong to the same perceptual or phonological class can be generated from a common template by applying the same set of alignment parameters. Accent curves can thus be interpreted as time-warped versions of a common template. It is further stipulated that accents of the same class sound the same because they are aligned in the same way with the segmental structure of the accent group that they are associated with. Conversely, two accent curves are phonologically distinct if they cannot be generated from the same template using the same set of alignment parameters. Thus, this quantitative model can play an important role in intonational phonology by facilitating the mapping of categorical phonological elements onto continuous acoustic parameters of the speech signal.

References:

Grzegorz Dogil (2000): "Understanding prosody". In Psycholinguistics - An International Handbook, ed. by G. Rickheit, T. Hermann and W. Deutsch (de Gruyter, Berlin)

Willem Levelt (1989): Speaking: From Intention to Articulation (MIT Press, Cambridge, MA)

Jan van Santen and Bernd Möbius (2000): "A quantitative model of F0 generation and alignment". In Intonation: Analysis, Modelling and Technology, ed. by A. Botinis (Kluwer, Dordrecht)

Prosodic Structuring Based on Optimized Tag-sets

A. F. Mueller (1), R. Hoffmann (2)

(1) Siemens AG, (2) Dresden University of Technology,

In order to predict natural prosody for our text-to-speech (TTS) system Papageno, symbolic prosody labels are generated prior to F0-contour generation.

This paper investigates the influences of using different tag-sets and different context lengths on the phrase break prediction module for English and German language of our TTS system. Further a new data-driven approach for finding an optimized tag-set is presented.

The phrase break prediction module is fully data-driven. It uses a neural network (NN) classifier that assigns prosodic phrase labels based on error information available from auto associative neural networks [1].

 

Figure 1: The auto associative NN. The matrices w1 and w2 solve the compression and decompression task.

Figure 1 shows an auto associative NN that solves a compression-decompression task. Input data represented in a very high m-dimensional space is compressed to a representation in a much lower n-dimensional space and then decompressed again. Thus irrelevant information is suppressed. This is an important feature for the aimed at application, since the coding for part-of-speech (POS) sequences with large tag-sets leads to a high dimensional representation of the information. If mt denotes the number of tags used then each POS-tag is coded with an mt-dimensional vector using a ternary logic. If msl denotes the sequence length considered (left/center/right POS-context) then we get an mt * msl = m dimensional input space.

The method has been tested on a large and a small tag-set for the German language and English language. The large tag-set for the German language consists of mt = 35 tags and the small tag-set consists of mt = 13 tags. The small tag-set comprises groups of tags from the large tag-set, that are carefully chosen and believed to be relevant to the problem. For a POS sequence length of msl = 11 the large tag-set leads to a 11 * 35 = 385-dimensional input space. By the suppression of irrelevant information it is possible to extract information even in such high dimensional spaces. This is shown by the results presented in table 1. The results presented are the correct scores for unrestricted newspaper text for the prediction accuracy of major phrase break labels. As can be seen the proposed method gains information from the larger tag-set.

mt

msl = 7

msl = 11

13

35

78.61%

78.51%

78.00%

81.23%

Table 1: Results for different tag-set sizes mt and different sequence lengths msl.

The problem of finding an optimized tag-set is subject of ongoing research. Promising results have been achieved with the approach described in the following. The approach is based on a numerical analysis of the reconstruction error vectors erec of each auto associative NN (figure 1). First the mean of each element of the error vector is computed for all input patterns. If this value reaches a certain threshold then the corresponding input is considered important. The use of the mean as criteria for importance is motivated by the assumption that only active outputs of the upper layer in figure 1 contribute to the decision which label will be assigned. Each input corresponds to a certain POS-tag at a certain position of the POS-sequence. If a POS-tag is considered important at several positions within the sequence it is kept, otherwise it is merged with another tag. This way an optimized tagset is determined.

References

[1] Achim F. Müller, Hans Georg Zimmermann, and Ralph Neuneier. Robust generation of symbolic prosody by a neural classifier based on autoassociators. In ICASSP, 2000.

 

Hungarian audiovisual prosody composer and TTS development environment

G. Olaszy, G. Kiss, G. Netameth

Phonetics Laboratory of HAS and TUB Dept. of Telecommunications

Correct prosody generation in speech synthesis is one of the most important demands. In order to determine the rules which govern the prosody generation and to perform fine tuning the data in the rules need special interactive tools which enables the access the internal structure of prosody synthesis. An audiovisual prosody composer have been developed as a part of a complete TTS development environment for making easier the research and synthesis of prosody elements in continuous speech. The main feature of the program is to synthesise (male voice concatenated from diphones) the given utterance (male and female voice concatenated from diphones) and to superimpose any kind of prosody. The prosody of the utterance can be determined on text level by markers and on sound level by data. The prosody data of the synthesised utterance are shown in a prosody matrix, where the columns represent the speech sounds, the rows the prosody parameters for every sound (duration level, amplitude, F-0 change, break point in the sound, the actual sound duration). Melody patterns, intensity data and the time structure are displayed parallel and can be modified at will through the data of the prosody matrix. The system can be used as an interactive tool for prosody research, demonstration, teaching and rule definition in TTS conversion. The system runs under Windows NT.

 

Interaction between vocalic quality and fundamental frequency in the perception of Polish vowels

M. Owsianny

Polish Phonetic Association, Międzychodzka 5, 60-371 Poznań

The effects of the interaction between formant frequencies and the fundamental frequency on the perception of vowel quality were studied for Polish vowels in auditory experiments using SMOK formant synthesiser. Earlier results showing that the identification of a vowel depends not only on formant frequencies, but also on their relation to F0, were confirmed and amplified. The discrepancy between the formant frequency and the voice pitch may lead to a shift in the phonetic category. Synthetic prototypes of Polish vowels were produced and modified by changing the value of two lowest formants. The influence of the changes of F0 parameter on the perception of the modified vowels were examined. The fields (in co-ordinates F1-F2, depending on F0 changes) of erroneous identification of the modified prototypes of Polish vowels spoken by male and female speakers were presented. Perception shifts between phonetic categories as a function of voice height were illustrated.

 

Focus intonation in German dialects

J. Peters

Universitaet Potsdam, Germany

The study of spontaneous speech data suggests that focus intonation in German may be subject to dialectal variation. Spontaneous speech data were recorded from 20 speakers born in Hamburg and Berlin. The analysis�of nuclear high accents with a terminal fall indicates that the two groups of speakers indicate focus structure by variable timing of the accentual gesture. However, both groups of speakers do this in different ways. Hamburg speakers were found to displace the F0 peak to the right when narrow focus is intended. Berlin speakers were found to displace the starting-point of the falling movement of F0 to the right under the same condition. Furthermore, in both groups of speakers, the segmental structure as well as the position of the accented syllable in the intonational phrase was found to interfere with accent realization.

In an autosegmental framework, both forms of variation can be characterized as two different forms of tonal reorganization. According to this analysis, the Hamburg speakers change the docking-point for the H-tone to signal narrow focus, whereas the Berlin speakers change the domain in which the H-tone affects the intonational contour. A set of rules for tone assignment is proposed which can account for most of the variation found in the two dialects. On the basis of this analysis, also some other differences in the timing of nuclear high accents found in an earlier study may be explained.

 

Probabilistic machine that learns phonemic transcription

A. Pluciński

Dep. of Phonetics, Institute of Linguistics, Adam Mickiewicz University, Poland

A simple data analysis system called “probabilistic machine” and its application for learning phonemic transcription on the basis of a training sample is proposed here. The analysis in this system resolves itself into registering histograms of reactions to stimuli sequences occurred in the training sample. The stimuli sequences on the other hand are registered with the tree structure typically used to represent data for fast searching. In the reverse action it can thus find a proper reaction for a given stimuli sequence. The most prominent property of this analysis system is its ability for analyzing a nominal data and ability to produce random reactions with an a priori given probability. It is not recquired that the training sample is error free. It gives us a possibility to synthetize more realistic sounding speech with natural stumbling and variations. The deterministic behavior, namely producing reactions, which have the highest probability, appears as a special case here. It can be also applied for continuous speech analysis. More generally speaking – the analysis system proposed here can be applied in searching such a problems which can be settled down to the schema stimuli sequence reaction.

 

Connected speech processes as multitier/multiarticulator prosodic modulations

B. Pompino-Marschall

Center for General Linguistic Typology and Universals, Berlin

A model is proposed that interprets a variety of connected speech processes as resulting from prosodic modulations at different tiers of functional speech motor control along the hypo-hyper dimension. The general background of the model is given by the trichotomy of A-, B- and C-prosodic phenomena (Tillmann 1980) that together constitute the acoustic makeup of any speech utterance (with regard to their respective time domains at the uttarance/phrase level, the syllabic level and the segmental level). One of the most general high-level tier may be characterized by the continuous modulation of the general adjustment of the vocal apparatus to speech (e.g. adducted/tense vocal chords, raised velum, mobile tongue body and blade) or purely vegetative functioning (e.g. open/slack glottis, lowered velum, inactive tongue). This modulation of the most general setting of the vocal apparatus at the A-prosodic level seems to be responsible for the universally found phenomena of final lengthening (interpreted here as articulatory relaxation), F0-declination as well as the 'phonemically' aberrant acoustic structure of hesitation particles (being more neutral then reduced vowels, possibly nasalized). At the B-prosodic level this same modulation of 'articulatory tonus' - triggered by metrical structure - may also be responsible for reductions in unstressed syllables and function words.

An independent second high-level prosodic modulation manifests itself in the control of global speech rate (syllable rate as independent from intrinsically controlled articulatory speed). Quite a variety of connected speech processes seems to be consequences of the interaction of both proposed prosodies: articulatory relaxation would also affect the strength of interarticulatory timing and together with the enhancement of global tempo may result in changed timing relations between the gestures at different articulator-defined tiers (velar, laryngeal, labial, mandibular, coronal, dorsal) that show their intrinsically specified time constants. The wellknown gestural overlap phenomena - resulting in segmental 'quasi assimilations' and 'elisions' - as well as the instability in timing of e.g. laryngeal reflexes of a glottal stop before word-initial vowels or of internasal plosives at the segmental level (Kohler 1999/2000) and of syllabicity (Pompino-Marschall 1999) could be interpreted as an interaction of the proposed multitier/multiarticulator prosodies. The different processes will be demonstrated by examples of German spontaneous speech.

 

Polish version of TAPS (Test of Auditory Perception of Speech for Children)

A. Pruszewicz, G. Demenko, T. Wika, L. Rychter, W. Szyfter, B. Woźnica, A. Sekula

Department of Phoniatrics and Audiology, Department of Otolaryngology Karol Marcinkowski University School of Medical Sciences in Poznań, Poland; Institute of Linguistics, Adam Mickiewicz University, Poznań, Poland

Special tests were developed for speech perception for children with CI on the basis of TAPS program from Cochlear Center in Basel, Switzerland. There is no change comparing to the English version in the first category (detection of speech sounds) and in the second one (the perception of speech patterns).

The fundamental changes of Polish version are in the third and fourth categories. The third category (speech identification) prepared for the perception of rhytmical speech patterns in words with various number of syllables and for word identification, consits of two versions. The first version represents a number of words classified according to the semantic criterion. The second version represents words classified by structural criteria - words from one set have the same vowel and consonant structure – v c :

Various structures of the words were taken into the consideration to test the perception of phonetic and acoustic structure of the Polish language:

1) c v c, where c represents low and high consonant

2) c v c c or c c v c , where each word has a stop consonant, as well as low and high ones

The fourth category dealing with sentence identification comprises the evaluation of suprasegmental information together with segmental information of simple sentences. All the sentences have the same structure and all the words consist of two syllables. Open set speech recognition, the fifth category, contains tests typical for the Polish language, concerning a defined topic presented on the picture.

 

The acoustic analysis of voice in children with multichannel cochlear implant

A. Pruszewicz, B. Woźnica, W. Szyfter, A. Sekula, E. Szymiec, P. Świdziński, M. Karlik

Department of Phoniatrics and Audiology and Department of Otolaryngology, Karol Marcinkowski, University School of Medical Sciences, Poznań, Poland

The material of our investigation was a group of 6 implanted children Nucleus Mini System 22 aged 4— 12 years, divided into 2 categories depending on the onset of deafness — prelingual and postlingual. The analysis was made before implantation and during rehabilitation after implantation. The examination was carried out using the KAY Electronics 4300 instrument and MDVP/CSL programmes. The acoustic analyses comprising spectrographic and tonographic examinations were performed on an appropriate linguistic material — vowels, isolated words, sentences and reading text — constructed on the basis of phonetic rules. The measurement of Fo was estimated in isolated vowels before and after implantation as well as the other parameters as mean value of jitter and shimmer, noise to harmonic ration (NHR), number of unvoiced segments (NUV), soft phonation index (SPI), peak amplitude variation (vAm) and fundamental frequency variation (vFo). The acoustic analyses of voice in implanted children, particularly with prelingual deafness, can be the objective evaluation of progress in speech and hearing rehabilitation. On the bases of the selected acoustic parameters of speech we can evaluate not only the periodical variability of Fo laryngeal tone parameters but also through spectrographic analysis we can evaluate the harmonic structure of speech sounds and intonation changes in implanted patients.

 

Intonatorische Verfahren

S. Rabanus

Forschungsinstitut für deutsche Sprache, Marburg

Intonation is an autonomous signalling system from which speakers choose cues, used in co-occurrence with cues from other autonomous signalling systems, for the constitution and contextualization of activity types in conversation. Participants in conversations use given intonation contours as a resource which they modify according to their target activities. In so doing, speakers of German and Italian use three intonational procedures: weakening of intonation contours is used to signal self-initiated self-repair, treatment of problems of hearing and understanding, and claim to turn; enforcement signals problems of expectation, defence of turn, contradiction, and insistence; modification of intonation contours signals defence of turn and contradiction. The assignment of intonational procedures to activity types appears to be the same in both Italian and German.

 

The psychological scale of the pitch measured by absolute magnitude estimation

A. Rakowski, A. Miśkiewicz

Chopin Academy of Music, Warsaw

The psychological pitch scale was measured by absolute magnitude estimation. A group of subjects assigned numbers to the pitch of pure tones. The tones were presented at a constant loudness level of 60 phons and their frequencies spanned a range of 31.5-12500 Hz in 1/3 octave steps. Results show that the function describing the relation of pitch to the tone's frequency is linear when plotted in logarithmic coordinates. The function is composed of two segments: the segment corresponding to frequencies higher than 250 Hz is about two times steeper than that obtained in the low-frequency range. The pitch scale determined in the present experiment is discussed in relation to other measures of pitch: the mel scale, the musical semitone scale, and the psychological pitch scales based on the measures of auditory filter bandwidth.

 

Prosodic-Syntactic Module for Continuous Speech Recognition Systems

A. Rozenknop, A. Drygajło

Artificial Intelligence Laboratory, Signal Processing Laboratory (LTS), Swiss Federal Institute of Technology Lausanne (EPFL)

Many papers have reported investigations concerning prosodic features in speech synthesis but only a few in speech recognition. The goal of this paper is to integrate a prosodic feature processing module in a speech recognition system to reduce ambiguities among different syntactical interpretations of a signal.

The complete process is as follows: First, a signal to be recognized is processed by a speech recognizer module based on hidden Markov models (HMMs) and a bigram language model. The output of this module provides a lattice of word hypotheses which represents a large collection of sentences in a compact way. Along with each word hypothesis comes the time alignement of its phonems with the signal. In parallel, prosodic features (pitch and energy) are computed using original signal, and represented at each acoustic vector. Then, the word lattice and prosodic features are transfered to a prosodic-syntactic module. Its task is to find the best interpretation of the signal by calculating prosodic and syntactic scores, and merging them. As a result, we can obtain a sentence with the highest score or a list of the best N sentences. Prosody of the sentence is characterized by three prosodic parameters: energy, pitch and time alignements of phonemes. The pitch, or fundamental frequency, is interpolated and smoothed on unvoiced segments, then filtered by three different band-pass filters. The time derivatives of the band-pass signals are calculated and added to the acoustic vector, which leads to eight pitch features for each time frame. Eight energy features are calculated similarly to the pitch features. Time alignements are used to compute the length of syllabic kernels and the distances to their neighbours. The syntactic module is based on a CYK-like, bottom-up parser and a Stochastic Context-Free Grammar. The parser, originally intended only for parsing sentences, has been extended to cope with lattices of word hypotheses. It is able to search the lattice and find the sentence with the highest score. The score represents the probability of a sentence with the stochastic grammar.

The integration of the prosodic features in the syntactic module is a two-step process. In the training step, a collection of syntactic trees is analysed, and a set of prosodic features is associated with each syntactic group occuring in the trees. Gaussian mixture models (GMMs) of prosodic features are then computed for each syntactic group, representing the distribution of probabilities of co-occurence between the syntactic groups and the prosodic features. In the recognition step, trees are built bottom-up, by choosing a higher syntactic group according to lower ones. The likelihood of prosodic features associated with each node of the tree is merged with the syntactic probability of the subtree rooted at that node, and the final score leads to the selection of the best groups.

This work is inspired by the INTARC system for the extraction of prosodic features, and by P. Langlais' PhD thesis, for the association between the prosody and syntax.

 

Emotional and linguistic prosody in right hemisphere damaged patients

K. Rymarczyk (1), E. Łojek (2)

(1) Nencki Institute of Experimental Biology, Polish Academy of Science; Department of Neurophysiology, e-mail :kr@nencki.gov.pl

(2) Faculty of Psychology, University of Warsaw, Warsaw, Poland

The main purpose of this presentation is to show the connections between prosody and the human brain. It is very well established now that the left hemisphere is involved in linguistic processes at the levels of phonetics, syntax and semantics. The studies conducted over the last two decades indicated strong contribution of the right brain hemisphere to various discourse functions, including prosody. Many authors have suggested that right hemisphere damage (RHD) can lead to two types of aprosody: emotional (affective) and linguistic. There are however data showing that the linguistic aspects of prosody can also be impaired in left hemisphere damaged (LHD) patients. In order to study aprosody in Polish brain damaged populations we have designed an experimental test battery. All measures were elaborated taking into account results of previous neuropsychological investigations in that area as well as the Polish language and culture. The battery consisted of the following tests:

Affective prosody comprehension tests: a) Estimation of adequacy of affective prosody with reference to semantic content b) Discrimination of sentences containing either the same or different emotional intonation c) Naming of emotions expressed by intonation.

Affective prosody expression test: repetition of sentences (affective intonation congruent with semantic content).

Linguistic prosody comprehension tests: a/ Discrimination of lexical stress (Literature, vs. literature) b/ Discrimination of emphatic stress (winter starts in December, vs. winter starts in December) c/ Discrimination of linguistic intonation (e.g. declarative vs. interrogative sentences) d/ Naming of linguistic intonation (declarative vs. interrogative sentences).

Linguistic prosody expression tests: a/ Repetition of sentences with emphatic stress (Winter starts in December) b/ Repetition of sentences with linguistic intonation (declarative, interrogative sentences).

Thirty seven RHD subjects, 10 LHD aphasics and 51 healthy controls participated in the experiment. Results indicated significant differences between the BD and control groups in all experimental tasks. However, the RHD patients had the lowest scores all measures. The LHD subjects scored lower then the RHD group on the linguistic prosody tasks, whereas RHD subjects got lower results on emotional prosody tasks. However, these tendencies were not significant. Clinical value of the applied methods for the assessment of emotional and linguistic aprosody will be discussed.

 

The comparison of basic speech units in automatic speech recognition

P. Staroniewicz

Institute of Telecommunication and Acoustics

The choice of elementary speech units belongs to a group of the most significant for the efficiency of automatic speech recognition (ASR) systems problems. Usually if the system is based on the recognition and labelling of elementary speech units, it meets problems of ambiguous segmentation and coarticulation effects. Considering the problem of the choice of phonological units for ASR it is necessary to mention of: allophones, phonemes, syllables, words, diphones, triphones and some combinations of the above. In case of using phonemes as basic units there are problems with their identification because of big differences in the class. The main advantage of using phoneme as a basic unit is that the set of phonemes for any language represents the smallest number of distinctive phonological units which is substantially smaller than the set of allophones, diphones, syllables or words. The disadvantages of using phoneme as the phonological unit in recognition systems are difficulties with identifying acoustically the phonemes and their boundaries. Selecting phonemes requires a substantial number of acoustic-phonetic and phonological rules at lower and higher levels within the recognition system, since none of the coarticulation and junctural phenomena of speech are represented in the phoneme. The term allophone is used to represent a set of phonemes within a given language which have the same information - bearing parameters or distinctive features, either physiologically or acoustically. Syllables are phonological units which can be generally define as a vowel nucleus and its functionally related neighbouring consonants though the specification of the syllable can become a problem for utterances that have two or more nuclei with surrounding consonants. It is well known that a great deal of the acoustic information, which is used to identify the consonants of a given language, lies in the transitions between the consonants-vowels or consonant-consonant. This specific unit which includes transitional information is named diphone. Recently, the first experiments in order to use diphones have been carried out. In comparison with other units, the disadvantages of selecting the diphone are: the inventory may be relatively large, and most phonological rules are not easily applied to diphones.

Speaking of main advantages of using diphones as elementary unit it is necessary to mention that the diphone includes some of the coarticulation rule information within itself, since this information lies in the transitions between sounds and moreover it has features which predestine that unit to be used in HMM-based systems.

 

Speech synthesizer's commercial success depending on recognition of the users' needs

J. Urbański

Harpo, Poland

The report presents relationship between close cooperation to a production unit during R&D phase and success in commercial applications of developed device. Examples of aids for the blind indicate necessity of interaction between potential users and developers in early stages of their work. Research on speech synthesizers for the blind sales proves that in some cases the researchers couldn't meet users' preferences even in most obvious issues.

 

Adding Morphological Knowledge to Neural Network Models of Finnish Prosody

M. Vainio

Department of Phonetics, University of Helsinki

The basic assumption in intonation models and perhaps generally in prosody models is, that part-of-speech information is of paramount importance for predicting the actual values for the prosodic parameters; be they pitch, segmental duration or loudness.

We have studied whether morphological knowledge, in addition to part-of-speech information, is of any help in predicting prosody in a morphologically rich language such as Finnish. Our research concerns Finnish prosody with respect to pitch, loudness and segmental duration.

The basic methodology we employ is based on artificial neural networks. It is a continuation of our earlier studies on prosody where we investigated the problem of generating values for prosodic parameters from symboli