ebook img

Fast Accurate Diphone-Based Phoneme Recognition PDF

179 Pages·2009·5.73 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Fast Accurate Diphone-Based Phoneme Recognition

Fast Accurate Diphone-Based Phoneme Recognition Marianne du Preez Thesis presented in partial fulfilment of the requirements for the degree Master of Science in Electronic Engineering at the University of Stellenbosch Supervisors: Prof. J. A. du Preez and Dr. H. A. Engelbrecht March 2009 Declaration I,theundersigned, herebydeclarethattheworkcontainedinthisthesisismyownoriginal work unless indicated otherwise, and that I have not previously in its entirety or in part submitted it at any university for a degree. signature date Copyright (cid:13)c 2009 Stellenbosch University All rights reserved 1 Abstract Statistical speech recognition systems typically utilise a set of statistical models of sub- word units based on the set of phonemes in a target language. However, in continuous speech it is important to consider co-articulation effects and the interactions between neighbouring sounds, as over-generalisation of the phonetic models can negatively affect system accuracy. Traditionally co-articulation in continuous speech is handled by incor- porating contextual information into the subword model by means of context-dependent models, which exponentially increase the number of subword models. In contrast, transi- tional models aim to handle co-articulation by modelling the interphone dynamics found in the transitions between phonemes. This research aimed to perform an objective analysis of diphones as subword units for use in hidden Markov model-based continuous-speech recognition systems, with special emphasis on a direct comparison to a context-dependent biphone-based system in terms of complexity, accuracy and computational efficiency in similar parametric conditions. To simulate practical conditions, the experiments were designed to evaluate these systems in a low resource environment – limited supply of training data, computing power and system memory – while still attempting fast, accurate phoneme recognition. Adaptation techniques designed to exploit characteristics inherent in diphones, as well as techniques used for effective parameter estimation and state-level tying were used to reduce resource requirements while simultaneously increasing parameter reliability. These techniques include diphthong splitting, utilisation of a basic diphone grammar, diphone set completion, maximum a posteriori estimation and decision-tree based state clustering algorithms. The experiments were designed to evaluate the contribution of each adaptationtechniqueindividuallyandsubsequentlycomparetheoptimiseddiphone-based recognition system to a biphone-based recognition system that received similar treatment. Results showed that diphone-based recognition systems perform better than both tra- ditionalphoneme-basedsystemsandcontext-dependentbiphone-basedsystemswheneval- uated in similar parametric conditions. Therefore, diphones are effective subword units, which carry suprasegmental knowledge of speech signals and provide an excellent compro- mise between detailed co-articulation modelling and acceptable system performance. i Opsomming Statistiese spraakherkenning maak tipies gebruik van ’n stel statistiese subwoordmodelle wat gebaseer is op die stel foneme in ’n gegewe taal. Dit is egter belangrik in kon- tinue spraak om ko-artikulasie en interaksie van naburige klanke in ag te neem, aangesien ’n oorveralgemening van fonetiese modelle die stelselakkuraatheid negatief kan be¨ınvloed. Tradisioneelwordhierdieko-artikulasieeffektehanteerdeurmiddelvankonteks-afhanklike modellering waar foneme in verskillende kontekste apart gemodelleer word. Hierdie proses veroorsaak ’n eksponensi¨ele groei in die aantal subwoord modelle. In teenstelling hiermee kan oorgangsmodelle gebruik word om die ko-artikulasie te hanteer deur die dinamika in die oorgange tussen foneme vas te vang. Hierdienavorsinghetbeoogom’nobjektieweanalisevandifoneassubwoordmodellein verskuilde Markov model-gebaseerde kontinue-spraakherkenningstelsels te doen. Spesiale klem is geplaas op ’n direkte vergelyking van die difoonstelsel met ’n konteks-afhanklike bifoonstelsel in terme van kompleksiteit, akkuraatheid en effektiewe verwerkingsverm¨oe in parametries eenderse toestande. Om praktiese toestande te simuleer is alle eksperimente ontwerp om die stelsels te evalueer in ’n omgewing met min hulpbronne – ’n beperkte hoeveelheid afrigdata, verwerkingskrag en stelselgeheue – maar steeds te mik vir vinnige, akkurate foneemherkenning. Aanpassingstegnieke is ontwerp en gebruik om difoonmodelle te optimeer deur hulp- bronvereistes te verminder en terselfdetyd parameter-betroubaarheid te verhoog. Hierdie tegnieke sluit in die aanwending van difooneienskappe deur middel van diftong-verdeling, gebruik van ’n basiese difoongrammatika, difoonstel-voltooiing, maximum a posteriori beraming en toestandsgroepering deur middel van beslissingsbome. Die eksperimente is ontwerp om die bydrae van elke tegniek individueel te ontleed, waarna die beste stelsel vergelyk word met ’n bifoonstelsel wat op ’n soortgelyke manier hanteer is. Resultate dui daarop dat difoon-gebaseerde stelsels beter vaar as beide tradisionele foneem- en konteks-afhanklike bifoon-gebaseerde stelsels in soortgelyke parametriese toe- stande. Difone is dus effektiewe subwoord-eenhede wat supra-segmentele inligting bevat en is ’n uitstekende kompromis tussen gedetaileerde ko-artikulasie modellering en aan- vaarbare stelsel-werkverrigting. ii Acknowledgements I would like to thank my study leaders, Prof. du Preez and Dr. Engelbrecht, for their ex- tensive help, guidance and support without which this thesis would have been impossible. Their expertise, suggestions and feedback were of immeasurable value. I want to express my gratitude to my family and friends for their encouragement and moral support. Special thanks goes to my mother for using her extensive wisdom and knowledge to help with the grammatical editing and proofreading of this document. She has provided assistance in numerous ways and helped shape and improve the end result. iii Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Statistical Speech Recognition . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Elementary Linguistic Theory . . . . . . . . . . . . . . . . . . . . . 3 1.2.3 Acoustic Modelling for use in Speech Recognition . . . . . . . . . . 8 1.3 Literature Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6.1 Background Theory on Statistical Speech Recognition . . . . . . . . 13 1.6.2 Analyses of Diphones and their Use in Speech Recognition . . . . . 14 1.6.3 Experiments, Results and Conclusions . . . . . . . . . . . . . . . . 15 2 Speech Recognition: Theoretical Background 18 2.1 Types of Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Literature Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 A Brief History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.2 The Use of Diphones in Speech Recognition . . . . . . . . . . . . . 23 2.3 The Speech Recognition System . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.1 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.2 Components of the Speech Recognition System . . . . . . . . . . . 30 2.3.3 Digital Signal Processing for Speech Signals . . . . . . . . . . . . . 32 2.3.4 Acoustic Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.3.5 Lexical Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.3.6 Language Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 i CONTENTS ii 3 Hidden Markov Model Theory 42 3.1 Definition of a Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . 43 3.1.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.1.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Algorithms Used with Hidden Markov Models . . . . . . . . . . . . . . . . 46 3.2.1 The Evaluation Problem . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.2 The Decoding Problem . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.3 The Learning Problem . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3 Hidden Markov Models Used in Speech Recognition . . . . . . . . . . . . . 54 3.3.1 HMM Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.2 State Output Probability Distributions . . . . . . . . . . . . . . . . 57 3.4 Implementation of Acoustic Modelling for Phoneme Recognition . . . . . . 61 3.4.1 Creating and Training the Model . . . . . . . . . . . . . . . . . . . 61 3.4.2 Alignment of the Labelled Training Set . . . . . . . . . . . . . . . . 63 3.4.3 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4 Diphones as Base Units for Speech Recognition 68 4.1 Speech Units Used in Linguistics . . . . . . . . . . . . . . . . . . . . . . . 68 4.1.1 Syllables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.1.2 Monophones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.1.3 Biphones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.1.4 Triphones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1.5 Diphones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Modelling Transitions versus Modelling Context Dependency . . . . . . . . 74 4.2.1 Trainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.2 Complexity and Resource Requirements . . . . . . . . . . . . . . . 78 4.2.3 Handling Inter-word Contexts . . . . . . . . . . . . . . . . . . . . . 79 4.2.4 Modelling of Unseen Contexts . . . . . . . . . . . . . . . . . . . . . 79 4.3 Implementation Strategies for Diphones . . . . . . . . . . . . . . . . . . . . 80 4.3.1 Non-parametric Methods . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.2 Parametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.3 Automatic Diphone Segmentation . . . . . . . . . . . . . . . . . . . 84 4.4 Acoustic Modelling with Diphones as Base Unit . . . . . . . . . . . . . . . 84 4.4.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.2 Model Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 CONTENTS iii 4.4.3 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5 Adaptation Techniques for Diphone Models 89 5.1 Diphthong Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 Basic Diphone Grammar for Phoneme Spotting . . . . . . . . . . . . . . . 90 5.3 Diphone Set Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3.1 Building Diphone Models from Well-trained Monophone Models . . 93 5.3.2 Bootstrapping the Diphone Set with Monophone models . . . . . . 94 5.4 Maximum A Posteriori Estimation . . . . . . . . . . . . . . . . . . . . . . 94 5.4.1 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . 95 5.4.2 MAP Estimation of Gaussian Mean Values . . . . . . . . . . . . . . 96 5.4.3 MAP Estimation as Used in this Thesis . . . . . . . . . . . . . . . . 97 5.5 Decision Tree Based State Clustering . . . . . . . . . . . . . . . . . . . . . 97 5.5.1 Overview of Decision Tree Logic . . . . . . . . . . . . . . . . . . . . 98 5.5.2 Classification and Regression Trees (CART) . . . . . . . . . . . . . 100 5.5.3 Creating a CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.5.4 Classification and Regression Trees as Used in this Thesis . . . . . . 103 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6 Experimental Investigation 106 6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.1.1 Hardware Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.1.2 Software Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.1.3 The AST Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.1.4 Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.1.5 Statistical Modelling Parameters . . . . . . . . . . . . . . . . . . . 108 6.1.6 System Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.1.7 Statistical Significance Tests . . . . . . . . . . . . . . . . . . . . . . 109 6.2 Monophone-based Continuous Phoneme Recognition . . . . . . . . . . . . 110 6.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.2.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.2.4 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.3 Diphone-based Continuous Phoneme Recognition . . . . . . . . . . . . . . 112 6.3.1 First Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . 114 CONTENTS iv 6.3.2 Diphthong Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.3.3 MAP Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.3.4 Decision-tree Based State Clustering . . . . . . . . . . . . . . . . . 124 6.3.5 Interpretation of Diphone Results . . . . . . . . . . . . . . . . . . . 128 6.4 Biphone-based Continuous Phoneme Recognition . . . . . . . . . . . . . . 130 6.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.4.4 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.5 Comparison of Systems in Limited-Resource Environments . . . . . . . . . 135 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7 Conclusions 138 7.1 Concluding Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.2 Context Within Existing Research . . . . . . . . . . . . . . . . . . . . . . . 139 7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.3.1 Diphone-based System . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.3.2 Comparison with a Triphone-based System . . . . . . . . . . . . . . 141 A Selected Topics from Linguistic Theory 152 A.1 International Phonetic Alphabet . . . . . . . . . . . . . . . . . . . . . . . . 152 A.2 Types of Phonemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 A.3 Additional Terms Related to the Production of Speech Sounds . . . . . . . 156 B Speech Corpus 158 B.1 The African Speech Technology (AST) Speech Corpus . . . . . . . . . . . . 158 B.1.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 B.1.2 Collection Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 158 B.1.3 Phoneme Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 B.2 Subword Unit Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 B.2.1 Monophones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 C CART 162 C.1 Question Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 C.2 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 C.3 Minimum Description Length Based Induction and Pruning . . . . . . . . 165 List of Figures 1.1 Time waveform of the word “test” . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Enlargement of the steady-state region of the vowel /E/ . . . . . . . . . . 6 1.3 Enlargement of the steady-state region of the consonant /s/ . . . . . . . . 6 1.4 General discrete-time filter model for speech production . . . . . . . . . . . 8 2.1 System diagram of the basic components in a speech recognition system . . 31 2.2 Relationship between the Mel frequency scale and the Hertz frequency scale 36 3.1 3 State left-to-right Hidden Markov Model . . . . . . . . . . . . . . . . . . 46 3.2 The forward algorithm. The partial probability α (j) is recursively de- t+1 fined by multiplying the output probability of state s with the sum of the j partial probabilities of the states leading to state s multiplied by their j respective transition probabilities. . . . . . . . . . . . . . . . . . . . . . . . 49 3.3 3 State fully connected Hidden Markov Model . . . . . . . . . . . . . . . . 55 3.4 5 State left-to-right Hidden Markov Model with non-emitting starting and terminating states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5 A two-dimensional Gaussian distribution . . . . . . . . . . . . . . . . . . . 58 3.6 Parallel-HMM model used for the decoding of phoneme sequences in con- tinuous phoneme recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1 Acoustic waveform, spectrogram and corresponding subdivision into differ- ent subword units for the utterance “Lend me your ears”. The subword unitsfromtoptobottomarephonemes,left-contextbiphones,right-context biphones and diphones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.1 Basic diphone grammar for the language defined by monophones “SIL”, “a”, “b” and “c”. The spotter assumes the existence of silence at the beginningandendofeachutterance. Theblacknullstatesarecalledcluster states and represent the set of diphones with a common first monophone. . 92 v

Description:
Phoneme Recognition 3.6 Parallel-HMM model used for the decoding of phoneme Two main types of parameter extraction algorithms are used in speech recognition
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.