1 Belief Hidden Markov Model for Speech Recognition Siwar Jendoubi∗ Boutheina Ben Yaghlane† Arnaud Martin‡ ∗University of Tunis, ISG Tunis, LARODEC Laboratory [email protected] †University of Carthage, IHEC Carthage, LARODEC Laboratory [email protected] ‡University of Rennes I, IUT de Lannion, UMR 6074 IRISA [email protected] 5 1 0 2 Abstract—Speech Recognition searches to predict the spoken several domains of research where incertitude and impreci- words automatically. These systems are known to be very sion dominate. They provide many tools for managing and n expensivebecauseofusingseveralpre-recordedhoursofspeech. processing the existent pieces of evidence in order to extract a Hence,buildingamodelthatminimizesthecostoftherecognizer J knowledge and make better decision. They allow experts to willbeveryinteresting.Inthispaper,wepresentanewapproach 2 for recognizing speech based on belief HMMs instead of proba- haveamoreclearvisionabouttheirproblems,whichishelpful 2 bilistic HMMs. Experiments shows that our belief recognizer is for finding better solutions. What’s more, belief functions insensitive to the lack of the data and it can be trained using theories present a more flexible ways to model uncertainty ] only one exemplary of each acoustic unit and it gives a good I andimprecisedatathanprobabilityfunctions.Finally,itoffers A recognitionrates.Consequently,usingthebeliefHMMrecognizer many tools with a higher ability to combine a great number can greatly minimize the cost of these systems. . of pieces of evidence. s c Index Terms—Speech recognition, HMM, Belief functions, Belief HMM gives a better classification rate than the [ Belief HMM. ordinary HMM when they are applied in a classification 1 problem. Consequently, we propose to use the belief HMM v in the speech recognition process. Finally, we note that this I. INTRODUCTION 0 is the first time where belief functions are used in speech 3 The automatic speech recognition is a domain of science processing. 5 that attracts the attention of the public. Indeed, who never In the next section we talk about the probabilistic hidden 5 0 dreamedoftalkingwithamachineoratleastcontrolanappa- Markov model and we define its three famous problems. In . ratus or a computer by voice. The speech processing includes Section three we present the probabilistic HMM recognizer, 1 twomajordisciplineswhicharethespeechrecognitionandthe the acoustic model and the recognition process. The transfer- 0 5 speechsynthesis.Theautomaticspeechrecognitionallowsthe able belief model is introduced in section four. In section five 1 machine to understand and process oral information provided we will talk about the belief HMM. In section six, we present v: by a human. It uses matching techniques to compare a sound our belief HMM recognizer, the belief acoustic model and the i wave to a set of samples, compounds generally of words or belief recognition process. Finally, experiments are presented X sub-words. On the other hand, the automatic speech synthesis in section seven. r allows the machine to reproduce the speech sounds of a given a text.Nowadays,mostspeechrecognitionsystemsarebasedon II. PROBABILISTICHMM the modelling of speech units known as acoustic unit. Indeed, AHiddenMarkovModelisacombinationoftwostochastic speechiscomposedofasequenceofelementarysounds.These processes;thefirstoneisaMarkovchainthatischaracterized soundsputtogethermakeupwords.Then,fromtheseunitswe by a finite set1 Ω of non observable N states (hidden) and t seeks to derive a model (one model per unit), which will be thetransitionprobabilities,a =P(cid:0)st+1 |st(cid:1), 1≤i,j ≤N, used to recognize continuous speech signal. Hidden Markov ij j i between them. The second stochastic process produces the Models (HMM) are very often used to recognize these units. sequence of T observations which depends on the proba- HMMbasedrecognizerisawidelyusedtechniquethatallows bility density function of the observation model defined as as to recognize about 80% of a given speech signal, but this b (O ) = P(cid:0)O |st(cid:1), 1 ≤ j ≤ N, 1 ≤ t ≤ T [4], in this recognitionratestillnotyetsatisfying.Also,thismethodneeds j t t j paperweuseamixtureofGaussiandensities.Theinitialstate many hours of speech for training which makes the automatic distribution is defined as π = P(cid:0)s1(cid:1), 1 ≤ i ≤ N. Hence, speech recognition task very expensive. i i an HMM λ(A,B,Π) is characterized by the transition matrix Recently, [7], [6] extend the Hidden Markov Model to A = {a }, the observation model B = {b (O )} and the ij j t the theory of belief functions. The belief HMM will avoid initial state distribution Π={π }. i disadvantagesofprobabilisticHMMwhichare,generally,due to the use of probability theory. Belief functions are used in 1tnotesthecurrentinstant,itisputinexponentofstatesforsimplicity. 2 There exist three basic problems of HMMs that must be the language, the problem with this choice is that the phone solved in order to be able to use these models in real world do not model its context. Such a model is called context applications. The first problem is named the evaluation prob- independent model. These models are generally used for lem, it searches to compute the probability P(O/λ) that the speechsegmentationsystems.Otherunitsthattakethecontext observation sequence O was generated by the model λ. This intoaccountcanbeusedasacousticunitasthediphonewhich probabilitycanbeobtainedusingtheforwardpropagation[4]. model the transition between two phones, the triphone which Recursively, it estimates the forward variable: model the transition between three phones, subwords, words. Thesemodelsarecalledcontextdependentmodels.According α (i) = P(O O ...O ,q =s |λ) (1) t 1 2 t t i to[5],whenthecontextisgreater,therecognitionperformance (cid:32) N (cid:33) (cid:88) improve. α (i) = α (i)a b (O ) (2) t t−1 ij j t b) The model: for each acoustic unit we associate an i=1 HMM, then types of HMM model and the probability density for all states and at all time instant. Then, P(O/λ) = function of the observation must be chosen. Generally, left- (cid:80)N α (i) is obtained by summing the terminal forward right models are used for speech recognition and speech i=1 T variables. Also, the backward propagation can be used to re- synthesis systems [4]. In fact, Speech signal has the property solvethisproblem.Unlikeforward,thebackwardpropagation that it changes over time, then the choice of the left-right goes backward. At each instant, it calculates the backward model is justified by the fact that there is no back transitions variable: and all transitions goes forward. The number of states is fixed inadvanceorchosenexperimentally.[2],[3]fixedthenumber β (i) = P(O O ...O |q =s , λ) (3) t t+1 t+1 T t i of state to three. This choice is justified by the fact that N (cid:88) most phoneme acoustic realization is characterized by three β (i) = a b (O )β (i) (4) t ij j t+1 t+1 sub-segments, hence we have a state for each sub-segment. j=1 [1], [12] used an HMM of six states. Finally, we choose finally,P(O |λ)=(cid:80)N α (i)β (i)isobtainedbycombining the probability density function of the observation. They are i=1 t t the forward and backward variable. The second problem represented by a mixture of Gaussian pdf, the number of is named the decoding problem. It searches to predict mixtures is generally chosen experimentally. the state sequence S that generated O. The Viterbi [4] Thenextstep,consistsontrainingparametersofeachHMM algorithm solves this problem. It starts from the first instant, using a speech corpus that contains many exemplary of each t = 1, for each moment t, it calculates δ (i) for every state acoustic unit. Speech segments are transformed into sequence t i, then it keeps the state which have the maximum δ = ofacousticvectorsbythemeanofafeatureextractionmethod t max P(q ,q ,...q ,q =i,O O ...O |λ)= like MFCC, these acoustic vectors are our sequence of obser- q1,q2,...,qt−1 1 2 t−1 t 1 2 t−1 max (δ (i)a )b (O ).When,thealgorithmreaches vations. 1≤i≤N t−1 ij j t the last instance t=T, it keeps the state which maximize δ . Then, HMMs are concatenated to each other and we obtain T Finally, Viterbi algorithm back-track the sequence of states as the model that will be used to recognize the new speech the pointer in each moment t indicates. The last problem is signal. The recognizer contains three levels; the first one is the learning problem, it seeks to adjust the model parameters the syntactic level. It represents all possible word sequences in order to maximize P(O |λ). Baum-Welch [4] method is that can be recognized by our model. The second level is widely used. This algorithm uses the forward and backward the lexical level. It represents the phonetic transcription (the variables to re-estimate the model parameters. phoneme sequence) of each word. Finally, the third level is the acoustic level. It models the realization of each acoustic unit (in this case the phone). III. PREBABILISTICHMMBASEDRECOGNIZER A. Acoustic model B. Speech recognition process The acoustic model attempts to mimic the human auditory The model described above is used for the speech recogni- system,itisthemodelusedbytheHMM-basedspeechrecog- tion process. Let S be our speech signal to be recognized. nizer in order to transform the speech signal into a sequence Recognizing S consists on finding the most likely path in of acoustic units, this last will be transformed into phoneme the syntactic network. The first step, is to transform S into a sequenceandfinallythedesiredtextisgeneratedbyconverting sequenceofacousticvectorsusingthesamefeatureextraction the phoneme sequence into text. Acoustic models are used by method used for training, then we obtain our sequence of speech segmentation and speech recognition systems. observationO.Themostlikelypathisthepaththatmaximizes The acoustic model is composed of a set of HMMs [4], the probability of observing O such the model P(O|λ). This each HMM corresponds to an acoustic unit. To have a good probabilitycanbedoneeitherbyusingtheforwardalgorithm, acoustic model some choices have to be done: or the Viterbi algorithm. a) The acoustic unit: the choice of the acoustic unit is very important, in fact, the number of them will influence IV. TRANSFERABLEBELIEFMODEL the complexity of the model (more large the number, more The Transferable Belief Model (TBM) [11], [10] is a well complex the model). If we choose a small unit like the used variant of belief functions theories. It is a more general phone we will have an HMM for every possible phone in system than the Bayesian model. 3 LetΩ ={ω ,ω ,...,ω }beourframeofdiscernment,The resolve this problem in the evidential case. It needs as inputs t 1 2 n agentbeliefonΩtisrepresentedbythebasicbeliefassignment mΩat(cid:2)Sit−1(cid:3)(cid:0)Sjt(cid:1) and mΩbt[Ot](cid:0)Sjt(cid:1) to calculate the forward (BBA) mΩt defined from 2Ω to [0,1]. mΩt(A) is the mass commonality: value assigned to the proposition A⊆Ω and it must respect: t (cid:80)ThAen⊆ΩwtemcΩatn(Aha)v=e m1.ΩAt(cid:2)lSsot,−w1(cid:3)e(Aca)nwdhefiicnheicsoandBitBioAnadleBfiBnAed. qαΩt+1(cid:0)Sjt+1(cid:1) = (cid:88) mΩαt(cid:0)Sit(cid:1).qaΩt+1[Sit](cid:0)Sjt+1(cid:1) conditionally to St−1 ⊆ Ωt−1. If we have mΩt(∅) > 0, our ∩qSΩitt⊆+Ω1t[O ](cid:0)St+1(cid:1) (13) BBA can be normalized by dividing the other masses by 1− b t j mΩt(∅)thentheconflictmassidredistributedandmΩt(∅)= Thislastiscalculatedrecursivelyfromt=1toT.[6]exploits 0. theconflictoftheforwardBBA(obtainedbyusingformula10) Basic belief assignment can be converted into other func- todefineanevaluationmetricthatcanbeusedforclassification tions.Theyrepresentthesameinformationunderotherforms. to choose the model that best fits the observation sequence or What’s more, they are in one to one correspondence and itcanalsobeusedtoevaluatethemodel.Then,givenamodel they are defined from 2Ω to [0,1]. We will use belief bel, λandanobservationsequenceoflengthT,theconflictmetric plausibility pl and commonality q functions: is defined by: T belΩ(A) = (cid:88) mΩ(B), ∀A⊆∅, A(cid:54)=∅ (5) Lc(λ)= T1 (cid:88)log(cid:0)1−mΩαt+1[λ](∅)(cid:1) (14) t=1 ∅(cid:54)=B⊆A mΩ(A) = (cid:88) (−1)|A|−|B|belΩ(B), ∀A⊆Ω (6) λ∗ =argmλaxLc(λ) (15) B⊆A A credal backward algorithm is also defined, recursively, it plΩ(A) = (cid:88) mΩ(B), ∀A⊆Ω (7) calculates the backward commonality from T to t=1. More details can be found in [7], [6]. For the decoding problem, B∩A=∅ mΩ(A) = (cid:88) (−1)|A|−|B|−1plΩ(cid:0)B¯(cid:1), ∀A⊆Ω (8) many solutions are proposed to extend the Viterbi algorithm to the TBM [7], [6], [8]. All of them search to maximize the B⊆A state sequence plausibility. According to the definition given (cid:88) qΩ(A) = mΩ(B), ∀A⊆Ω (9) in [8], the plausibility of a sequence of singleton states S = B⊇A (cid:8)s1,s2,...,sT(cid:9), st ∈Ω is given by: t mΩ(A) = (cid:88) (−1)|B|−|A|qΩ(B), ∀A⊆Ω (10) T T A⊆B pl (S)=pl (cid:0)s1(cid:1).(cid:89)plΩt(cid:2)st−1(cid:3)(cid:0)st(cid:1).(cid:89)pl (cid:0)st(cid:1) (16) δ π a b ConsidertwodistinctBBAmΩ andmΩ definedonΩ,wecan t=2 t=1 1 2 obtain mΩ through the TBM conjunctive rule (also called Hence, we can choose the best state sequence by maximiz- 1∩2 conjunctive rule of combination CRC) [9] as: ing this plausibility. For the learning problem, [6], [8] have proposed some solutions to estimate model parameters, we (cid:88) mΩ1∩2(A)= mΩ1 (B)mΩ2 (C), ∀A⊆Ω (11) will talk about the method used in this paper. The first step B∩C=A consistsonestimatingthemixtureofGaussianmodels(GMM) Equivalently, we can calculate the CRC via a more simple parameters using Expectation-Maximization (EM) algorithm. expression defined with the commonality function: For each state we estimate one GMM. These models are used to calculate mΩt[O ](cid:0)St(cid:1). [6] proposes to estimate the qΩ (A)=qΩ(A)qΩ(A), ∀A⊆Ω (12) b t j 1∩2 1 2 credal transition matrix independently from the transitions themselves. He uses the observation BBAs as: V. BELIEFHMM 1 mΩt×Ωt+1 ∝ (17) Belief HMM is an extension of the probabilistic HMM to a T −1 belief functions [7], [6], [8]. Like probabilistic HMM, the T belief HMM is a combination of two stochastic processes. ∗(cid:88)(cid:16)mΩt[O ]↑Ωt×Ωt+1 ∩mΩt+1[O ]↑Ωt×Ωt+1(cid:17) b t b t+1 Hence, a belief HMM is characterized by: t=1 • sTehteofcrBeBdaAl tfruanncstiitoionnsdmeafitnriexdAcon=di(cid:8)timonΩaatll(cid:2)yStiot−a1l(cid:3)l(cid:0)pSojsts(cid:1)i(cid:9)blea wheremΩbt[Ot]↑Ωt×Ωt+1 andmΩbt+1[Ot+1]↑Ωt×Ωt+1 arecom- puted using the vacuous extension operator [9] of the BBA • BsTuhBbeAseotfsbusnoecfrtvisoatantitsoendseSmfiitno−ed1de,lcoBndi=tion(cid:110)amllyΩbtto[Othte](cid:0)sSetjt(cid:1)o(cid:111)fapossestibolef mΩbmt[ΩOt↑t]Ω(cid:0)tS×jtΩ(cid:1)t+o1n(Ath)e=ca(cid:40)rtemsiΩbatn(pBro)duicftAsp=acBe a×s:Ωt+1 (18) observation O , b 0 otherwise t (cid:110) (cid:16) (cid:17)(cid:111) • The initial state distribution Π= mΩπ1 SiΩ1 . This estimation formula is used by [8] as an initialization The three basic problem of HMM and their solutions are forITS(IterativeTransitionSpecialization)algorithm.ITSis extended to belief functions. As we know the forward al- aniterativealgorithmthatusesthecredalforwardalgorithmto gorithm resolves the evaluation problem in the probabilistic improve the estimation results of the credal transition matrix. case. [7] introduced the credal forward algorithm in order to It stops when the conflict metric (formula 14) converged. 4 VI. BELIEFHMMBASEDRECOGNIZER Our goal is to create a speech recognizer using the belief HMM instead of the probabilistic HMM. HMM recognizer uses an acoustic model to recognize the content of the speech signal.Then,weseektomimicthismodelinordertocreatea beliefHMMbasedone.Weshouldnotethatexistentparameter estimation methods presented for the belief HMM cannot be used to estimate model parameters using multiple observation sequences. This fact should be taken into account when we design our belief acoustic model. A. Belief acoustic model Figure1. Influenceofthenumberofobservationsontherecognitionrate Intheprobabilisticcase,weuseanHMMforeachacoustic unit, its parameters are trained using multiple speech realiza- of models, every model gives a value for the conflict metric. tion of the unit [5], [1], [2], [12], [3]. In the credal case, a Thenwecalculatethearithmeticmeanoftheresultantvalues. similar model cannot be used. Hence, we present an alternate Finally,wechoosethesetofmodelsthatoptimizestheaverage method that takes this fact into account. oftheconflictmetricinsteadofoptimizingtheconflictmetric, Let K be the number of the speech realization of a given as proposed by [6], using formula 15. acoustic unit. These speech realization are transformed into MFCC feature vectors. Hence, we obtain K observation VII. EXPERIMENTS sequences. Our training set will be:O = (cid:2)O1,O2,...,OK(cid:3) In this section we present experiments in order to validate where Ok = (cid:0)Ok,Ok,...,Ok (cid:1) is the kth observation se- our approach. We compare our belief HMM recognizer to a 1 2 Tk quence of length T . These observations are supposed to be similar one implemented using the probabilistic HMM. k independent to each other. So instead of training one model We use MFCC (Mel Frequency Cepstral Coefficient) as for all observation set O, we propose to create a belief model feature vectors. Also, we use a three state HMM and two for each observation sequence Ok. These K models will be Gaussian mixtures. Finally, to evaluate our models we calcu- used to represent the given acoustic unit in the recognition latethepercentofcorrectlyrecognizedacousticunits(number process. ofcorrectlyrecognizedacousticunit/totalnumberofacoustic Like the acoustic model based on the probabilistic HMM, units).Weuseaspeechcorpusthatcontainsspeechrealization we have to make some choices in order to have a good belief ofsevendifferentacousticunitsandwehavefifteenexemplary acousticmodel.Inthefirstplace,wechoosetheacousticunit. of each one. Results are shown in figure 1. The same choices of the probabilistic case can be adopted for The lack of data for training the probabilistic HMM leads the belief case. In the second place, we choose the model. to a very poor learning and the resultant acoustic model We should note that we cannot choose the topology of the cannot be efficient. Then using a training set that contains belief HMM, this is due to the estimation process of the only one exemplary of each acoustic unit leads to have a bad credal transition matrix. In other words, the resultant credal probabilistic recognizer. In this case our belief HMM based observation model is used to estimate the credal transition recognizer gives a recognition rate equal to 85.71% against matrixwhichdoesnotgiveasthehandtochoosethetopology 13.79% for the probabilistic HMM which is trained using of our resultant model. Consequently, choosing the model in HTK[13].ThisresultsshowsthatthebeliefHMMrecognizer the credal case consists on choosing the number of states is insensitive to the lack of data and we can obtain a good and the number of Gaussian mixtures. In our case we fix belief acoustic model using only one observation for each the number of states to three and we choose the number of unit. In fact, the belief HMM models knowledge by taking Gaussian mixtures experimentally. into account doubt, imprecision and conflict which leads to a discriminative model in the case of the lack of data. HTKisatoolkitforHMMsanditisoptimizedfortheHMM B. Speech recognition process speech recognition process. It is known to be powerful under The belief acoustic model is used in the speech recognition the condition of having many exemplary of each acoustic process.Now,weexplainhowtheresultantmodelwillbeused unit. Hence, it needs to use several hours of speech for for recognizing speech signal. training.Havingagoodspeechcorpusisveryexpensivewhich Let S be our speech signal to be recognized. Recognizing influence the cost of the recognition system. Then, the speech S consists on finding the most likely set of models. The first recognition systems are very expensive. Consequently, using step, is to transform S into a sequence of acoustic vectors the belief HMM recognizer can greatly minimize the cost of using the same feature extraction method used for training, these systems. then we obtain our sequence of observation O. This last is used as input for all models. The credal forward algorithm is VIII. CONCLUSION thenapplied,eachmodelgivesusanoutputwhichisthevalue In this paper, we proposed the Belief HMM recognizer. of the conflict metric. An acoustic unit is presented by a set We showed that incorporating belief functions theory in the 5 speechrecognitionprocessisverybeneficial,infact,itreduces considerablythecostofthespeechrecognitionsystem.Future works will be focuced on the case of the noisy speech signal. Indeed,existentspeechrecognizerstillnotyetgoodifwehave a noisy signal to be decoded. REFERENCES [1] F. Brugnara, D. Falavigna, and M. Omologo. Automatic segmentation and labeling of speech based on hiddenmarkov models. Speech Com- munication,12:370–375,1993. [2] P.Carvalho,L.C.Oliveira,I.M.Trancoso,andM.C.Viana.Concatena- tivespeechsynthesisforeuropeanportuguese. In3rdESCA/COCOSDA WorshoponSpeechSynthesis,pages159–163,1998. [3] S. Cox, R. Brady, and P. Jackson. Techniques for accurate automatic annotationofspeechwaveforms. InProc.ICASSP,1998. [4] L.Rabiner. Atutorialonhiddenmarkovmodelsandselectedapplica- tionsinspeechrecognition. ProceedingsofIEEE,77:257–286,1989. [5] L. Rabiner and B. H. Juang. Fundamentals of speech recognition. Prentice-Hall,Inc.UpperSaddleRiver,NJ,USA,1993. [6] E. Ramasso. Contribution of belief functions to HMM with an ap- plication to fault diagnosis. IEEE International Workshop on Machine LearningandSignalProcessing,Grenoble,France,September2-42009. [7] E. Ramasso, M. Rombaut, and D. Pellerin. Forward-backward-viterbi procedures in thetransferable belief model for state sequence analysis usingbelieffunctions.ECSQARU,Hammamet:Tunisie,pages405–417, 2007. [8] L.Serir,E.Ramasso,andN.Zerhouni. Time-slicedtemporalevidential networks: the case of evidential hmm with application to dynamical system analysis. IEEE International Conference on Prognostics and HealthManagement.Denver,Colorado,USA,2011. [9] P.Smets.Beliefsfunctions:Thedisjunctiveruleofcombinationandthe generalizedbayesiantheorem. IJAR,9:1–35,1993. [10] P.Smets. Belieffunctionsandthetransferablebeliefmodel. Available onwww.sipta.org/documentation/belief/belief.ps,2000. [11] P. Smets and R. Kennes. The transferable belief model. artificial intelligence. 66(2):191–234,1994. [12] D.T.Toledano,L.A.H.Gomez,andL.V.Grande.Automaticphonetic segmentation. IEEE Trans. Speech, Audio Processing, 11(6):617–625, 2003. [13] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, V. Valtchev, and P. Woodland. The htk book for htk version 3.4. MicrosoftCorporationandCambridgeUniversityEngineeringDepart- ment,2006.