UniversitéduQuébec Institutnationaldelarecherchescientifique(INRS) Énergie,MatériauxetTélécommunications(EMT) Language Modeling for Speech Recognition Incorporating Probabilistic Topic Models AdissertationsubmittedinpartialfulfillmentoftherequirementsforthedegreeofDoctor ofPhilosophy By Md. AkmalHaidar EvaluationJury ExternalExaminer JohnHershey,MitsubishiElectricResearchLaboratories ExternalExaminer LidiaMangu,IBMCorporation InternalExaminer TiagoFalk,INRS-EMT ResearchDirector DouglasO’Shaughnessy,INRS-EMT © CopyrightbyMd. AkmalHaidar,2014 Iwouldliketodedicatethisthesistomylovingparents... Declaration I hereby declare that except where specific reference is made to the work of others, the contents of this dissertation are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other University. This dissertationistheresultofmyownworkandincludesnothingwhichistheoutcomeofwork doneincollaboration,exceptwherespecificallyindicatedinthetext. Md. AkmalHaidar 2014 Acknowledgements I would like to express my special appreciation and thanks to my supervisor Professor Dr. DouglasO’Shaughnessy,youhavebeenatremendousmentorforme. Iwouldliketothank you for encouraging my research and for allowing me to grow as a research scientist. Your wide knowledge and logical way of thinking have been of great value of me. Your under- standing, encouraging and personal guidance have provided a good basis for this thesis. Youradviceonbothresearchaswellasonmycareerhavebeenpriceless. Iwouldalsoliketothankmycommitteemembers,Dr. LidiaMangu,Dr. JohnHershey, Professor Dr. Tiago Falk for serving as my committee members even at hardship. I also want to thank you for your brilliant comments and suggestions, thanks to you. Moreover, I would like to thank all the staffs especially Madame Hélène Sabourin, Madame Nathalie Aguiar,Mr. SylvainFauvelofINRS-EMT,UniversityofQuebec,PlaceBonaventure,Mon- treal, Canada, for their support in the past years. Besides, thanks goes to Dr. Mohammed SenoussaouiforhelpingmetowritetheFrenchsummaryofthisthesis. A special thanks to my parents. Words cannot express how grateful I am to you for all ofthesacrificesthatyou’vemadeonmybehalf. Yourprayerformewaswhatsustainedme thusfar. Iwouldalsoliketothankmysisters,brothersandallofmyfriendswhosupported metostrivetowardsmygoal. Lastbutnottheleast,thankstoalmighty,themostmercifulandthemostpassionate,for providingmetheopportunitytostepintheexcellentresearchworldofspeechandlanguage technology. Abstract The use of language models in automatic speech recognition helps to find the best next word, to increase recognition accuracy. Statistical language models (LMs) are trained on a large collection of text to automatically determine the model’s parameters. LMs encode linguisticknowledgeinsuchawaythatcanbeusefultoprocesshumanlanguage. Generally, a LM exploits the immediate past information only. Such models can capture short-length dependencies between words very well. However, in any language for communication, words have both semantic and syntactic importance. Most speech recognition systems are designed for a specific task and use language models that are trained from a large amount of text that is appropriate for this task. A task-specific language model will not do well for a different domain or topic. A perfect language model for speech recognition on general language is still far away. However, language models that are trained from a diverse style of language can do well, but are not perfectly suited for a certain domain. In this research, we introduce new language modeling approaches for automatic speech recognition (ASR) systemsincorporatingprobabilistictopicmodels. Inthefirstpartofthethesis,weproposethreeapproachesforLMadaptationbycluster- ing the background training information into different topics incorporating latent Dirichlet allocation(LDA).Inthefirstapproach,ahard-clusteringmethodisappliedintoLDAtrain- ingtoformdifferenttopics. Weproposeann-gramweightingtechniquetoformanadapted modelbycombiningthetopicmodels. Themodelsarethenfurthermodifiedbyusinglatent semanticmarginals(LSM)usingaminimumdiscriminantinformation(MDI)technique. In the second approach, we introduce a clustering technique where the background n-grams are directed into different topics using a fraction of the global count of the n-grams. Here, theprobabilitiesofthen-gramsfordifferenttopicsaremultipliedbytheglobalcountsofthe n-grams and are used as the counts of the respective topics. We also introduce a weighting technique that outperforms the n-gram weighting technique. In the third approach, we pro- pose another clustering technique where the topic probabilities of the training documents are multiplied by the document-based n-gram counts and the products are summed up for alltrainingdocuments;therebythebackgroundn-gramsareassignedintodifferenttopics. x In the second part of the thesis, we propose five topic modeling algorithms that are trained by using the expectation-maximization (EM) algorithm. A context-based proba- bilistic latent semantic analysis (CPLSA) model is proposed to overcome the limitation of a recently proposed unsmoothed bigram PLSA (UBPLSA) model. The CPLSA model can compute the correct topic probabilities of the unseen test document as it can compute all the bigram probabilities in the training phase, and thereby yields the proper bigram model for the unseen document. The CPLSA model is extended to a document-based CPLSA (DCPLSA) model where the document-based word probabilities for topics are trained. To propose the DCPLSA model, we are motivated by the fact that the words in different doc- uments can be used to describe different topics. An interpolated latent Dirichlet language model (ILDLM) is proposed to incorporate long-range semantic information by interpo- lating distance-based n-grams into a recently proposed LDLM. Similar to the LDLM and ILDLM models, we propose enhanced PLSA (EPLSA) and interpolated EPLSA (IEPLSA) modelsinthePLSAframework. In the final part of the thesis, we propose two new Dirichlet class language models that are trained by using the variational Bayesian EM (VB-EM) algorithm to incorporate long- range information into a recently proposed Dirichlet class language model (DCLM). The latentvariableofDCLMrepresentstheclassinformationofann-grameventratherthanthe topic in LDA. We introduce an interpolated DCLM (IDCLM) where the class information isexploitedfrom(n-1)previoushistorywordsofthen-gramsthroughDirichletdistribution using interpolated distanced n-grams. A document-based DCLM (DDCLM) is proposed wheretheDCLMistrainedforeachdocumentusingdocument-basedn-gramevents. In all the above approaches, the adapted models are interpolated with the background model to capture the local lexical regularities. We perform experiments using the ’87-89 WallStreetJournal(WSJ)corpusincorporatingamulti-passcontinuousspeechrecognition (CSR) system. In the first pass, we use the background n-gram language model for lattice generationandthenweapplytheLMadaptationapproachesforlatticerescoringinthesec- ondpass. Supervisor: DouglasO’Shaughnessy,Ph.D. Title: ProfessorandProgramDirector
Description: