Institut Polytechnique d e Hanoi Ce mémoire de thèse est confidentiel THÈSE Pour obtenir le grade de DOCTEUR DE LA COMMUNAUTÉ UNIVERSITÉ GRENOBLE ALPES préparée dans le cadre d’une cotutelle entre la Communauté Université Grenoble Alpes et l’ Institut Polytechnique de Hanoi Spécialité : EEATS Arrêté ministériel : le 6 janvier 2005 - 7 août 2006 Présentée par Thi-Anh-Xuan TRAN Thèse dirigée par Eric CASTELLI codirigée par Thi-Ngoc-Yen PHAM co-encadrée par Nathalie VALLEE préparée au sein de l’Institut de Recherche International MICA (Multimédia, Information, Communication et Applications) – Hanoi, Vietnam et du Laboratoire GIPSA-Lab (Grenoble Images Parole Signal Automatique) – Grenoble, France dans l’École Doctorale Électronique Électrotechnique Automatique & Traitement du Signal ACOUSTIC GESTURE MODELING. APPLICATION TO A VIETNAMESE SPEECH RECOGNITION SYSTEM Thèse soutenue publiquement le 30 mars 2016, devant le jury composé de : Mme. Martine ADDA-DECKER Directrice de Recherche, CNRS, Laboratoire de Phonétique et Phonologie, Paris, Président M. Georges LINARÈS Professeur de l’Université d’Avignon et des Pays de Vaucluse, Avignon, Rapporteur M. François PELLEGRINO Directeur de Recherche, CNRS, Dynamique du Langage, Lyon, Rapporteur M. Eric CASTELLI Professeur & Chargé de Recherche, CNRS, MICA, Hanoi, Directeur de thèse Mme. PHAM Thi Ngoc Yen Professeur, Institut Polytechnique de Hanoi, Co-directeur de thèse Mme. Nathalie VALLÉE Chargée de Recherche, CNRS, GIPSA-lab, Grenoble, Co-encadrante Acknowledgments Foremost, I would like to express my most sincere and deepest gratitude to my thesis supervisors Prof. Eric CASTELLI, Prof. PHAM Thi Ngoc Yen (MICA_CNRS, Vietnam) and Dr. Nathalie VALLÉE (GIPSA-lab, Grenoble) for their continuous support and guidance during my PhD program, and for providing me with such a serious and inspiring research environment. A big thank to Prof. Eric CASTELLI that guided me throughout all the years of thesis for shaping my thesis at the beginning, for his support, his advice on my research and writing. I am also very thankful to Dr. Nathalie VALLÉE for her advice and encouragement during all my thesis period. I am fortunate to have the opportunity to work with Prof. René CARRÉ (DR émérite, CNRS). He taught me various essential knowledge such as speech production, speech perception. I am very grateful to Prof. René CARRÉ for his intense participation in the partial orientation of my research. I highly appreciate the opportunity to know and work with M. Alexis MICHAUD (MICA_CNRS, Vietnam). I am sincerely indebted to Alexis for his comments on linguistics and writing. I would like to very thank to M. Jean-Marc THIRIET, director of GIPSA-lab, for accepting me in speech and cognition department. A big thanks to Prof. PHAM Thi Ngoc Yen (former director of MICA institute) and M. NGUYEN Viet Son, director of MICA institute, for allowing me to work at SpeechCom department. I take this opportunity to extend my heartfelt gratitude to all members in MICA (especially to the members of the SpeechCom department) and all members in GIPSA-lab (especially to the members of the speech and cognition department), who welcome me to work there and give me a lot of useful comments and discussions concerning my work. Last but very the importance, I would like to dedicate this moment to my parents and my husband for their endless love and support during all my thesis, who have given me much courage to accomplish this thesis. ABSTRACT Speech plays a vital role in human communication. Selection of relevant acoustic speech features is key in the design of any system using speech processing. For some 40 years, speech was typically considered as a sequence of quasi-stable portions of signal (vowels) separated by transitions (consonants). Despite a wealth of studies that clearly document the importance of coarticulation, and reveal that articulatory and acoustic targets are not context-independent, the view that each vowel has an acoustic target that can be specified in a context-independent manner remains widespread. This point of view entails strong limitations. It is well known that formant frequencies are acoustic characteristics that bear a clear relationship with speech production, and that can distinguish among vowels. Therefore, vowels are generally described with static articulatory configurations represented by targets in the acoustic space, typically by formant frequencies in F1-F2 and F2-F3 planes. Plosive consonants can be described in terms of places of articulation, represented by locus or locus equations in an acoustic plane. But formant frequencies trajectories in fluent speech rarely display a steady state for each vowel. They vary with speaker, consonantal environment (co-articulation) and speaking rate (relating to continuum between hypo- and hyper-articulation). In view of inherent limitations of static approaches, the approach adopted here consists in studying both vowels and consonants from a dynamic point of view. Firstly we studied the effects of the impulse response in the beginning, at the end and during transitions of the signal both in the speech signal and at the perception level. Variations of the phases of the components were then examined. Results show that the effects of these parameters can be observed in spectrograms. Crucially, the amplitudes of the spectral components distinguished under the approach advocated here are sufficient for perceptual discrimination. From this result, for all speech analysis, we only focus on amplitude domain, deliberately leaving aside phase information. Next we extent the work to vowel-consonant-vowel perception from a dynamic point of view. These perceptual results, together with those obtained earlier by Carré (2009a), show that vowel-to-vowel and vowel-consonant-vowel stimuli can be characterized and separated by the direction and rate of the transitions on formant plane, even when absolute frequency values are outside the vowel triangle (i.e. the vowel acoustic space in absolute values). Due to limitations of formant measurements, the dynamic approach needs to develop new tools, based on parameters that can replace formant frequency estimation. Spectral Subband Centroid Frequency (SSCF) features was studied. Comparison with vowel formant frequencies show that SSCFs can replace formant frequencies and act as “pseudo-formant” even during consonant production. On this basis, SSCF is used as a tool to compute dynamic characteristics. We propose a new way to model the dynamic speech features: we called it SSCF Angles. Our analysis work on SSCF Angles were performed on transitions of vowel-to-vowel (V1V2) sequences of both Vietnamese and French. SSCF Angles appear as reliable and robust parameters. For each language, the analysis results show that: (i) SSCF Angles can distinguish V1V2 transitions; (ii) V1V2 and V2V1 have symmetrical properties on the acoustic domain based on SSCF Angles; (iii) SSCF Angles for male and female are fairly similar in the same studied transition of context V1V2; and (iv) they are also fairly invariant for speech rate (normal speech rate and fast one). And finally, these dynamic acoustic speech features are used in Vietnamese automatic speech recognition system with several obtained interesting results. Key words: vowel gesture, dynamic acoustic features, magnitude of speech, transition direction and rate, SSCF Angles, automatic speech recognition. Contents List of figures ....................................................................................................................................ix List of tables .................................................................................................................................... xix Abbreviations .................................................................................................................................. xxi Introduction ........................................................................................................................................ 1 Part I. State of the art .......................................................................................................................... 6 Chapter 1 State-of-the-art on speech feature ................................................................................. 7 1.1 Speech production ............................................................................................................ 7 1.2 State of the art on static speech ....................................................................................... 10 1.3 The paradox of static speech approach ............................................................................ 11 1.4 State of the art on dynamic speech .................................................................................. 14 1.4.1 Production dynamics of speech .................................................................................. 15 1.4.1.1 Reviewing dynamic characteristic of French vowel-to-vowel trajectories ............ 16 1.4.1.1.1 [aV] characteristics in the F1-F2 plane ........................................................ 17 1.4.1.1.2 [aV] transition rate ...................................................................................... 17 1.4.1.2 Reviewing dynamic characteristic of Vietnamese speech production................... 19 1.4.1.2.1 Vietnamese database .................................................................................. 19 1.4.1.2.2 The dynamic characteristic on Vietnamese vowel production ...................... 20 1.4.1.2.3 The dynamic characteristic on Vietnamese final consonant production /p, t, k/ ……………………………………………………………………………….21 1.4.2 Perceptual dynamics of speech ................................................................................... 22 1.4.2.1 Review on Vowel-to-Vowel perception ............................................................. 23 1.4.2.1.1 Methodology ............................................................................................... 23 1.4.2.1.2 Results in perception and conclusions.......................................................... 24 1.4.2.2 Other previous studies on perceptual dynamics of speech ................................... 25 i 1.4.3 Dynamics in speech applications ................................................................................ 28 1.5 A first study on acoustic Vietnamese vowel gesture based on formant ............................. 30 1.5.1 Methodology ............................................................................................................. 30 1.5.2 Stimuli ....................................................................................................................... 31 1.5.3 Results ....................................................................................................................... 31 1.5.4 Limitations ................................................................................................................ 32 1.6 Conclusions of chapter 1 ................................................................................................. 32 Part II. Contributions ........................................................................................................................ 35 Chapter 2 A study of speech signal in terms of amplitude and phase ........................................... 35 2.1 Introduction .................................................................................................................... 35 2.2 Characteristics of impulse response and magnitude of the spectral components in Vietnamese speech ....................................................................................................................... 38 2.2.1 Experiment 1 – Impulse responses are produced in natural speech .............................. 38 2.2.1.1 Methodology ...................................................................................................... 38 2.2.1.2 Observation ........................................................................................................ 39 2.2.1.3 Conclusion ......................................................................................................... 39 2.2.2 Experiment 2 – Impulse response during the vocal tract transitions............................. 39 2.2.2.1 Methodology ...................................................................................................... 39 2.2.2.2 Results ............................................................................................................... 40 2.2.2.3 Discussion .......................................................................................................... 42 2.2.2.4 Conclusion ......................................................................................................... 42 2.2.3 Experiment 3 – Speech signal characterization from power spectrum and phase spectrum .................................................................................................................................. 42 2.2.3.1 Methodology ...................................................................................................... 42 2.2.3.2 Observations and discussions ............................................................................. 43 2.2.4 Experiment 4 – The role of amplitude spectrum in perceptive speech ......................... 44 2.2.4.1 Objective............................................................................................................ 44 2.2.4.2 Methodology ...................................................................................................... 44 2.2.4.2.1 Stimuli ........................................................................................................ 44 2.2.4.2.2 Perception test ............................................................................................. 45 ii 2.2.4.3 Results and discussions ...................................................................................... 45 2.3 Conclusions of chapter 2 ................................................................................................. 46 Chapter 3 Dynamic acoustic characteristics at the speech perception level .................................. 49 3.1 Introduction .................................................................................................................... 50 3.2 Consonant perception in pseudo-V1CV2 ......................................................................... 52 3.2.1 General methodology ................................................................................................. 52 3.2.1.1 Type of experiments ........................................................................................... 52 3.2.1.2 Perceptual test process ........................................................................................ 52 3.2.2 Non-illusion experiment............................................................................................. 52 3.2.2.1 Purpose .............................................................................................................. 52 3.2.2.2 Stimuli ............................................................................................................... 53 3.2.2.3 Results ............................................................................................................... 53 3.2.3 Illusion experiment .................................................................................................... 55 3.2.3.1 Purpose .............................................................................................................. 55 3.2.3.2 Stimuli ............................................................................................................... 55 3.2.3.3 Results ............................................................................................................... 56 3.3 Discussion ...................................................................................................................... 58 3.4 Conclusions of chapter 3 ................................................................................................. 59 Chapter 4 Modeling dynamic acoustic speech features ............................................................... 61 4.1 The “pseudo-formant” parameters - Spectral Subband Centroid (SSC) features ............... 63 4.1.1 Definition of SSCF features ....................................................................................... 63 4.1.2 Design of SSCF features ............................................................................................ 64 4.1.3 Comparison between SSCF features and formant frequencies ..................................... 65 4.1.3.1 SSCF features have properties similar to formant frequencies ............................. 65 4.1.3.2 SSCF as continuous parameters on time domain, unlike formant frequencies ...... 68 4.1.3.3 Isolated vocalic SSCF parameters and vocalic formant frequencies ..................... 70 4.2 Modeling acoustic dynamic speech features – SSCF Angles............................................ 71 4.2.1 Acoustic Vietnamese vowel gesture on SSCF parameter plane ................................... 71 4.2.1.1 Methodology ...................................................................................................... 71 iii 4.2.1.1.1 Stimuli ........................................................................................................ 71 4.2.1.1.2 Implementation ........................................................................................... 71 4.2.1.2 Results ............................................................................................................... 71 4.2.2 Modeling acoustic and dynamic speech features from SSCF parameters – SSCF Angles …………………………………………………………………………………………75 4.3 Calculation of the acoustic and dynamic speech features using SSCF Angles .................. 76 4.4 SSCF Angles analysis on Vietnamese Vowel – to – Vowel transitions ............................ 78 4.4.1 Methodology ............................................................................................................. 78 4.4.1.1 Vietnamese stimuli ............................................................................................. 78 4.4.1.2 Analysis method ................................................................................................. 79 4.4.2 Results ....................................................................................................................... 79 4.4.2.1 Case 1: SSCF Angles comparisons among different transitions for each speaker . 80 4.4.2.1.1 SSCF Angle12 ............................................................................................ 80 4.4.2.1.2 SSCF Angle23 ............................................................................................ 81 4.4.2.1.3 SSCF Angle34 ............................................................................................ 83 4.4.2.2 Case 2: SSCF Angles comparisons with same items among males and females ... 84 4.4.2.2.1 /ai/ sequence ............................................................................................... 84 4.4.2.2.2 /au/ sequence ............................................................................................... 86 4.4.2.2.3 /iu/ sequence ............................................................................................... 88 4.4.2.2.4 Other Vietnamese V1V2 transition sequences.............................................. 90 4.4.2.3 Vietnamese V1V2 transitions in 3-D plane of SSCF Angles ............................... 90 4.4.2.3.1 Group of /ai, aɛ, ae/ transitions in 3-D plane of SSCF Angles ...................... 91 4.4.2.3.2 Group of /ia, ɛa, ea/ transitions in 3-D plane of SSCF Angles ...................... 92 4.4.2.3.3 Group of /oa, ɔa, ua/ in 3-D plane of SSCF Angles ...................................... 93 4.4.2.3.4 Group of /ao, aɔ, au/ in 3-D plane of SSCF Angles ...................................... 93 4.4.3 Conclusions ............................................................................................................... 94 4.5 SSCF Angles analysis on French Vowel-to-Vowel transitions ......................................... 95 4.5.1 Methodology ............................................................................................................. 95 4.5.1.1 French stimuli .................................................................................................... 95 iv
Description: