MarkovModels for PatternRecognition Gernot A. Fink Markov Models for Pattern Recognition From Theory to Applications With51Figures 123 GernotA.Fink DepartmentofComputerScience UniversityofDortmund Otto-Hahn-Str.16 44221Dortmund Germany [email protected] LibraryofCongressControlNumber:2007935304 Originallypublished intheGermanlanguage byB.G.Teubner Verlagas“Gernot A.Fink: MustererkennungmitMarkov-Modellen”.© B.G.TeubnerVerlag|GWVFachverlageGmbH, Wiesbaden2003 ISBN 978-3-540-71766-9 SpringerBerlinHeidelbergNewYork Thisworkissubject tocopyright. Allrightsarereserved,whether thewholeorpartofthe materialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations, recitation,broadcasting,reproductiononmicrofilmorinanyotherway,andstorageindata banks.Duplicationofthispublicationorpartsthereofispermittedonlyundertheprovisions oftheGermanCopyrightLawofSeptember9,1965,initscurrentversion,andpermission forusemustalwaysbeobtainedfromSpringer.Violationsareliableforprosecutionunder theGermanCopyrightLaw. SpringerisapartofSpringerScience+BusinessMedia springer.com ©Springer-VerlagBerlinHeidelberg2008 Theuseofgeneraldescriptivenames,registerednames,trademarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. Typesetting:bytheAuthor Production:LE-TEXJelonek,Schmidt&VöcklerGbR,Leipzig Coverdesign:KünkelLopkaWerbeagentur,Heidelberg Printedonacid-freepaper 33/3180/YL-543210 Formyparents Preface The developmentof pattern recognition methodson the basis of so-called Markov models is tightly coupled to the technological progress in the field of automatic speech recognition.Today,however,Markovchain andhiddenMarkovmodelsare also applied in many other fields where the task is the modeling and analysis of chronologicallyorganizeddata,forexamplegeneticsequencesorhandwrittentexts. Nevertheless,in monographs,Markovmodelsare almostexclusivelytreated in the contextofautomaticspeechrecognitionandnotasageneral,widelyapplicabletool ofstatisticalpatternrecognition. In contrast, this book puts the formalism of Markov chain and hidden Markov modelsatthecenterofitsconsiderations.Withtheexampleofthethreemainappli- cationareasofthistechnology—namelyautomaticspeechrecognition,handwriting recognition,andtheanalysisofgeneticsequences—thisbookdemonstrateswhich adjustmentstotherespectiveapplicationareaarenecessaryandhowthesearereal- izedincurrentpatternrecognitionsystems.Besidesthetreatmentofthetheoretical foundations of the modeling, this book puts special emphasis on the presentation ofalgorithmicsolutions,whichareindispensableforthesuccessfulpracticalappli- cation of Markov model technology. Therefore, it addresses researchers and prac- titioners from the field of pattern recognition as well as graduate students with an appropriatemajorfieldofstudy,whowanttodevotethemselvestospeechorhand- writing recognition,bioinformatics,or related problemsand want to gain a deeper understandingoftheapplicationofstatisticalmethodsintheseareas. Theoriginsofthisbooklieintheauthor’sextensiveresearchanddevelopmentin thefieldofstatisticalpatternrecognition,whichinitiallyledtoaGermanbookpub- lishedbyTeubner,Wiesbaden,in2003.Thepresenteditionisbasicallyatranslation of the German versionwith several updatesand modificationsaddressing an inter- nationalaudience.Thisbookwouldnothavebeenpossiblewithouttheencourage- mentandsupportofmycolleagueThomasPlo¨tz,UniversityofDortmund,Germany, whomIwouldliketocordiallythankforhisefforts. Dortmund,July2007 GernotA.Fink Contents 1 Introduction................................................... 1 1.1 ThematicContext .......................................... 3 1.2 FunctionalPrinciplesofMarkovModels ....................... 3 1.3 GoalandStructureoftheBook ............................... 5 2 ApplicationAreas.............................................. 7 2.1 Speech ................................................... 7 2.2 Writing................................................... 14 2.3 BiologicalSequences ....................................... 22 2.4 Outlook................................................... 26 PartI Theory 3 FoundationsofMathematicalStatistics ........................... 33 3.1 RandomExperiment,Event,andProbability.................... 33 3.2 RandomVariablesandProbabilityDistributions................. 35 3.3 ParametersofProbabilityDistributions ........................ 37 3.4 NormalDistributionsandMixtureModels...................... 38 3.5 StochasticProcessesandMarkovChains....................... 40 3.6 PrinciplesofParameterEstimation ............................ 41 3.6.1 MaximumLikelihoodEstimation....................... 41 3.6.2 MaximumaposterioriEstimation ...................... 43 3.7 BibliographicalRemarks .................................... 44 4 VectorQuantization............................................ 45 4.1 Definition................................................. 46 4.2 Optimality ................................................ 47 Nearest-NeighborCondition ................................. 47 CentroidCondition......................................... 48 4.3 AlgorithmsforVectorQuantizerDesign ....................... 50 X Contents Lloyd’sAlgorithm.......................................... 50 LBGAlgorithm............................................ 52 k-Means-Algorithm ........................................ 53 4.4 EstimationofMixtureDensityModels......................... 55 EMalgorithm ............................................. 56 4.5 BibliographicalRemarks .................................... 59 5 HiddenMarkovModels......................................... 61 5.1 Definition................................................. 61 5.2 ModelingEmissions ........................................ 63 5.3 UseCases................................................. 65 5.4 Notation .................................................. 67 5.5 Evaluation ................................................ 68 5.5.1 TheProductionProbability ............................ 68 ForwardAlgorithm .................................. 69 5.5.2 The“Optimal”ProductionProbability................... 71 5.6 Decoding ................................................. 74 ViterbiAlgorithm.................................... 75 5.7 ParameterEstimation ....................................... 76 5.7.1 Foundations......................................... 76 Forward-BackwardAlgorithm ......................... 78 5.7.2 TrainingMethods.................................... 79 Baum-WelchAlgorithm .............................. 80 Viterbitraining...................................... 86 Segmentalk-Means.................................. 88 5.7.3 MultipleObservationSequences ....................... 90 5.8 ModelVariants............................................. 91 5.8.1 AlternativeAlgorithms ............................... 91 5.8.2 AlternativeModelArchitectures........................ 92 5.9 BibliographicalRemarks .................................... 92 6 n-GramModels................................................ 95 6.1 Definition................................................. 95 6.2 UseCases................................................. 96 6.3 Notation .................................................. 97 6.4 Evaluation ................................................ 98 6.5 ParameterEstimation ....................................... 100 6.5.1 RedistributionofProbabilityMass...................... 101 Discounting......................................... 101 6.5.2 IncorporationofMoreGeneralDistributions ............. 104 Interpolation........................................ 104 BackingOff ........................................ 106 6.5.3 OptimizationofGeneralizedDistributions ............... 107 6.6 ModelVariants............................................. 109 6.6.1 Category-BasedModels............................... 109 Contents XI 6.6.2 LongerTemporalDependencies........................ 111 6.7 BibliographicalRemarks .................................... 112 PartII Practice 7 ComputationswithProbabilities................................. 119 7.1 LogarithmicProbabilityRepresentation........................ 120 7.2 LowerBoundsforProbabilities............................... 122 7.3 CodebookEvaluationforSemi-ContinuousHMMs .............. 123 7.4 ProbabilityRatios .......................................... 124 8 ConfigurationofHiddenMarkovModels ......................... 127 8.1 ModelTopologies .......................................... 127 8.2 Modularization ............................................ 129 8.2.1 Context-IndependentSub-WordUnits................... 129 8.2.2 Context-DependentSub-WordUnits .................... 130 8.3 CompoundModels ......................................... 131 8.4 ProfileHMMs ............................................. 133 8.5 ModelingEmissions ........................................ 136 9 RobustParameterEstimation ................................... 137 9.1 FeatureOptimization ....................................... 139 9.1.1 Decorrelation ....................................... 140 PrincipalComponentAnalysisI........................ 141 Whitening.......................................... 144 9.1.2 DimensionalityReduction............................. 147 PrincipalComponentAnalysisII....................... 147 LinearDiscriminantAnalysis.......................... 148 9.2 Tying..................................................... 152 9.2.1 ModelSubunits...................................... 153 9.2.2 StateTying ......................................... 157 9.2.3 TyinginMixtureModels.............................. 161 9.3 InitializationofParameters .................................. 163 10 EfficientModelEvaluation...................................... 165 10.1 EfficientEvaluationofMixtureDensities....................... 166 10.2 BeamSearch .............................................. 167 10.3 EfficientParameterEstimation ............................... 170 10.3.1 Forward-BackwardPruning ........................... 170 10.3.2 SegmentalBaum-WelchAlgorithm ..................... 171 10.3.3 TrainingofModelHierarchies ......................... 173 10.4 Tree-likeModelOrganization ................................ 174 10.4.1 HMMPrefixTrees ................................... 174 10.4.2 Tree-likeRepresentationforn-GramModels............. 175 XII Contents 11 ModelAdaptation.............................................. 181 11.1 BasicPrinciples............................................ 181 11.2 AdaptationofHiddenMarkovModels ......................... 182 Maximum-LikelihoodLinear-Regression ...................... 184 11.3 Adaptationofn-GramModels................................ 186 11.3.1 CacheModels....................................... 187 11.3.2 Dialog-StepDependentModels ........................ 187 11.3.3 Topic-BasedLanguageModels......................... 187 12 IntegratedSearchMethods...................................... 189 12.1 HMMNetworks ........................................... 192 12.2 Multi-PassSearch .......................................... 193 12.3 SearchSpaceCopies........................................ 194 12.3.1 Context-BasedSearchSpaceCopies .................... 195 12.3.2 Time-BasedSearchSpaceCopies ...................... 196 12.3.3 Language-ModelLook-Ahead ......................... 197 12.4 Time-SynchronousParallelModelDecoding.................... 198 12.4.1 GenerationofSegmentHypotheses ..................... 199 12.4.2 LanguageModel-BasedSearch ........................ 200 PartIII Systems 13 SpeechRecognition............................................. 207 13.1 RecognitionSystemofRWTHAachenUniversity ............... 207 13.2 BBNSpeechRecognizerBYBLOS ........................... 209 13.3 ESMERALDA............................................. 210 14 CharacterandHandwritingRecognition.......................... 215 14.1 OCRSystembyBBN....................................... 215 14.2 DuisburgOnlineHandwritingRecognitionSystem............... 217 14.3 ESMERALDAOfflineRecognitionSystem..................... 219 15 AnalysisofBiologicalSequences ................................. 221 15.1 HMMER ................................................. 222 15.2 SAM ..................................................... 223 15.3 ESMERALDA............................................. 224 References......................................................... 227 Index ............................................................. 245 1 Introduction Theinventionofthefirstcalculatingmachinesandthedevelopmentofthefirstuni- versal computers was driven by the idea to liberate people from certain every-day tasks.Atthattimeonereallythoughtofhelpincomputationsonlyandbynomeans ofhelpinghandsinprivatehomes.Thecomputingmachinesdevelopedthusshould takeovertasks,whichofcoursecouldbecarriedoutbyhumans,too.However,these taskscouldbeperformedwithsubstantiallymoreperseverancebyanautomaticsys- temandthusmorereliablyandconsequentlyalsomorecheaply. Therapidprogressinthedevelopmentofcomputertechnologysoonallowedre- searcherstodreamoffarmoreambitiousgoals.Intheendeavorofcreatingso-called “artificial intelligence” (AI), one tried to outperform the capabilities of humans in certainareas.IntheearlydaysofAI,primarilythesolvingofmathematicalorother- wiseformallydescribedproblemsbysymbolicmethodswasregardedasintelligence. Therefore,foralongtimetheprototypicalareaofresearchwastheplayofchess.The victoryofthechesscomputerDeepBlueovertheworldchampionKasparovin1997 arguablymeantanimportantpublicrelationsactivityforIBM.However,intheend it provedonly that playingchess is probablynot such a typicalachievementof in- telligence,asinthisdisciplineeventhebesthumanexpertcanbedefeatedbyrather brutecomputingpower.Inthefieldofunderstandinglanguage,whichiscentralfor humancapabilities,however,allsymbolicandrulebasedmethods,whichoriginated fromtherootsofAIresearch,couldachievemoderatesuccessonly. Meanwhile,a radicalchangein paradigmshas beencompleted.Typicalhuman intelligence now is no longer considered to be manifest on the symbolic level, but ratherinthecapabilitiesforprocessingvarioussensoryinputdata.Amongtheseare thecommunicationbyspokenlanguage,theinterpretationofvisualinput,andthein- teractionwiththephysicalenvironmentbymotion,touch,andgrasping.Bothinthe areaofautomaticimageandspeechprocessingaswellasinthefieldofroboticsfor many years first solutions were developedfrom an engineeringbackground.Since it has been proven by the successful use of statistical methods, that automatically trained systems by far outperform their “hard-wired” rule-based counterparts with respectto flexibilityand capabilitiesrealized,the conceptof learningreceivesspe- cial attention in these areas of research. In this respect the human example still is