Table Of Content

Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information Zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften vonderFakulta¨tfu¨rInformatik desKarlsruherInstitutsfu¨rTechnologie(KIT) genehmigte DISSERTATION von Ngoc Thang Vu ausHanoi,Vietnam Tagdermu¨ndlichenPru¨fung: 23.1.2014 ErsteGutachterin: Prof. Dr.-Ing. T.Schultz ZweiterGutachter: Prof. E.Barnard Acknowledgments Iwouldliketothankmysupervisor,Prof. TanjaSchultz. Shealwaysbelieved in my research and supported me with many useful discussions. Her great personalityandexcellentresearchskillhadaverystrongeffectonmyscientific career.Moreover,allthetravelswhichareoneofthemostbeautifulexperiences inmylifewouldbenotpossiblewithouthersupport. Special thanks to my second supervisor Prof. Etienne Barnard who always supportedmyresearch. Iamalsogratefulthathereadmythesisandprovided many useful suggestions and comments. It was very kind of him to take the longtripfromSouthAfricatoKarlsruhetoparticipateinthedissertationcom- mittee. I started my PhD program at CSL in September 2009 but the first experience withspeechrecognitionhavebeendoneinmymasterthesisaboutVietnamese ASR.IwasexcitedtoworkonspeechrecognitionforVietnamese, mymother tounge. Till now, I am always very grateful to all my relatives, my friends in Hanoi, and Ho Chi Minh city, Vietnam as well as in Karlsruhe, Germany to support me collecting the Vietnamese GlobalPhone data. This database and the first exiting work on automatic speech recognition motivated me to start myPhDinmultilingualspeechrecognition. Moreover, thanks to Roger Hsiao, I learned to build my first ASR system for Frenchwithalargeamountoftrainingdata. Hesharedwithmemanyexperi- encesrelatedtodiscriminativetrainingforacousticmodels. IwasinUSAforthefirsttimein2010whenIhadthechancetovisitInterAct atCarnegieMellonUniversityandworkedwithFlorianMetzeonBottle-Neck features. Thanks to him, I learned more about the Janus Speech recognition toolkitandBottle-Neckfeatures. I was extremely fortunate to participate in the KALDI workshop in 2011 and 2013. ThereIgottoknowmanynewfriendswhoareexcellentresearchers. The exchangewithDavidImseng,StefanKombrink,KorbinianRiedhammer,Karel Versely,ArnabGhoshal,MartinKarafiat,PetrMotlicek,YanminQian,andSan- jeev Khundapur helped me a lot. Thanks to Stefan Kombrink, I gathered the i firstexperiencewithrecurrentneuralnetworklanguagemodeling. Thanksto DavidImseng,IhadabetterunderstandingofKullback-LeiblerHMMdecod- ing. It was a great experience working with him on our first joint paper for ICASSP2014. Furthermore,itwasagreatpleasuretoworkwithDanielPovey whohadastrongeffectonmyresearchwithhisexcellentresearchskills. In 2013, I achieved the “Kontakte knu¨pfen” scholarship which allowed me to travel to different research groups to present my thesis and obtain feedback. Again, I had a chance to work with Daniel Povey on multilingual Deep Neu- ral Network acoustic modeling. It was great to learn from him about deep neural networks. As a part of this tour, I also visited Nuance, ISCI and SRI International. ThankstoSanjeevKhundapur,PaulVozila,KorbinianRiedham- mer,AndreasStolcke,NelsonMorgan,Yik-CheungTam,andDimitraVergyri, Iobtainedmanyusefulfeedbacksformydissertation. Furthermore, I would like to thank all my friends and my colleagues at CSL foragreattime. Theirsupportismagnificent. ThankstoTimSchlippe,Michael Wand,MatthiasJanke,DominicTelaar,DominicHeger,ChristophAmma,Chris- tianHerff,FelixPutze,HeikeAdel,UdhyakumarNallasamy,DirkGehrigand DanielReichformanygreattravelexperiencesandlovelyactivitiesafterwork. SpecialthankstoTimSchlippeandDominicTelaarfortheirsupportduringdif- ficultmoments. ThankstoFranziskaKraus,JochenWeiner,ZlatkaMihaylova, EdyGuevaraKomgangDjomgang,WojtekBreiter,YuanfanWang,MartenKlose andMichaelIkkertfortheirencouragement.Moreover,thankstoHelgaScherer forhersupport. SpecialthankstoHeikeAdelforhersupportandusefuldiscussions. Shewas always there for me when I had a difficult time. It was also great to work togetherwithheronlanguagemodelingforCode-Switching.Iamverygrateful thatshereadandimprovedallthepagesofmythesis. Finally, special thanks to my parents and my sister for their support all the time. IttookmorethantenyearsformeinGermanytoobtainthediplomaand the PhD in computer science. It was a very long journey and they have been alwaysthereforme. ii Summary Thisthesisexploresmethodstorapidlybootstrapautomaticspeechrecognition systems (ASR) for languages, which lack resources for speech and language processing - called low-resource languages. We focus on finding approaches whichallowusingdatafrommultiplelanguagestoimproveASRsystemsfor thoselanguagesondifferentlevels,suchasfeatureextraction,acousticmodel- ingandlanguagemodeling.Underapplicationaspects,thisthesisalsoincludes researchworkonnon-nativeandCode-Switchingspeech,whichhavebecome morecommoninthemodernworld. Themaincontributionsofthisthesisareasfollows: BuildinganASRsystemwithouttranscribedaudiodata:Inthisthesis,wede- velopedamultilingualunsupervisedtrainingframeworkwhichallowsbuild- ing ASR systems without transcribed audio data. Several existing ASR systems from different languages were used in combination with cross-language transfertechniquesandunsupervisedtrainingtoiterativelytranscribetheau- diodataofthetargetlanguageand,therefore,bootstrapASRsystems. Thekey contributionistheproposalofaword-basedconfidencescorecalled“Multilin- gualA-stabil”whichworkswellnotonlywithwelltrainedacousticmodelsbut also with a poorly estimated acoustic model, such as one which is borrowed from other languages in order to bootstrap the acoustic model for an unseen language. AlltheexperimentalresultsshowedthatitispossibletobuildASR systemsfornewlanguageswithoutanytranscribeddata,evenifthesourceand thetargetlanguagesarenotrelated. MultilingualBottle-Neckfeatures:WeexploredmultilingualBottle-Neck(BN) featuresandtheirapplicationtorapidlanguageadaptationtonewlanguages. Ourresultsrevealedthatusingamultilingualmultilayerperceptron(MLP)to initializetheMLPtrainingfornewlanguagesimprovedtheMLPperformance and, therefore, theASRperformance. Finally, visualizationofthefeaturesus- ingt-SNEleadstoabetterunderstandingofthemultilingualBNfeatures. Improving ASR performance on non-native speech using multilingual and crosslingual information: This part presents our exploration of using multi- iii lingualandcrosslingualinformationtoimprovetheASRperformanceonnon- nativespeech. WeshowedthatamultilingualASRsystemconsistentlyoutper- formsamonolingualASRsystemonnon-nativespeech. Finally,weproposed amethodcalledcross-lingualaccentadaptationtoimprovetheASRperformance on non-native speech without any adaptation data. With this approach, we achievedsubstantialimprovementsoverthebaselinesystem. Multilingual deep neural network based acoustic modeling for rapid lan- guageadaptation: Thisthesiscomprisesaninvestigationofmultilingualdeep neuralnetwork(DNN)basedacousticmodelinganditsapplicationtonewlan- guages. WeinvestigatedtheeffectofphonemergingonmultilingualDNNin thecontextofrapidlanguageadaptationandthecombinationofmultilingual DNNswithKullback–Leiblerdivergencebasedacousticmodeling(KL-HMM). OurstudiesrevealedthatKL-HMMbaseddecodingconsistentlyoutperformed conventional hybrid decoding, especially in low-resource scenarios. Further- more, we found that multilingual DNN training equally benefits from simple phonesetconcatenationandamanuallyderiveduniversalphonesetbasedon IPA. MultilinguallanguagemodelingforCode-Switchingspeech:Weinvestigated theintegrationofhighlevelfeatures,suchaspart-of-speechtagsandlanguage identifiersintolanguagemodelsforCode-Switchingspeech.Ourresultsshowed thatusingthesefeaturesinstate-of-the-artlanguagemodelingtechniques,such asrecurrentneuralnetworkandfactoredlanguagemodelsimprovedtheper- plexity and mixed error rate on Code-Switching speech. Moreover, the inter- polated language model between these two LMs gave the best performance on the SEAME database. Finally, we showed that Code-Switching is speaker dependentand,therefore,Code-Switchingattitudedependentlanguagemod- elingfurtherimprovedtheperplexityandthemixederrorrate. Webelievethatourfindingswillhaveanincreasingimpactovertimenotonly for research but also for industry. The results can be used to save costs and developmental time for the building of a speech recognizer for a new language. In addition, the contribution of this thesis on non-native and Code- Switchingspeechwillbecomemoreimportantduetotherapidlygrowingglob- alization. iv Zusammenfassung IndieserArbeiterforschenwirverschiedeneMethoden,umautomatischeSprach- erkennungssysteme (ASR) fu¨r neue Sprachen mit wenigen Ressourcen zu en- twickeln.InsbesonderekonzentrierenwirunsaufAnsa¨tze,Datenausmehreren Sprachenzuverwenden,umverschiedeneKomponentenderASRsolcherSpra- chenwieMerkmalsextraktion,akustischeModellierungundSprachmodellierung zuverbessern. InBezugaufAnwendungenbeinhaltetdieseDissertationauch Forschungen u¨ber akzentbehaftete und Code-Switching Sprache, die in der modernenWeltimmerhaüfigervorkommen. DiewichtigstenBeitra¨gedieserArbeitsinddiefolgenden: Aufbau eines ASR-Systems ohne transkribierte Sprachdaten: In dieser Ar- beit wird ein multilinguales, unu¨berwachtes Trainingsframework entwickelt, dasdenAufbaueinesASR-SystemsohnetranskribierteDatenermo¨glicht. Idee istes,SpracherkennerandererSpracheninderKombinationmitunu¨berwach- tem Training zu verwenden. Dadurch werden die Zeit und Kosten fu¨r das Transkribieren der Sprachdaten minimiert. Ein wesentlicher Beitrag ist die Entwicklung eines wortbasierten Konfidenzmaßes namens “multilingual A- stabil”, das nicht nur mit robusten akustischen Modellen, sondern auch mit einemschwachenakustischenModellfunktioniert.AlleexperimentellenErgeb- nisse zeigen, dass wir ein ASR-System fu¨r neue Sprachen ohne transkribierte Datenbauenko¨nnen,selbstwenndieQuell-undZielsprachennichtverwandt sind. MultilingualeBottle-NeckSprachmerkmale: DieIntegrationvonneuronalen Netzen in die Vorverarbeitung des Spracherkenners in Form von Bottle-Neck Merkmale ist Stand der aktuellen Forschung. In dieser Arbeit werden multilinguale neuronale Netze und ihre Anwendbarkeit fu¨r neue Sprachen unter- sucht. WirstelleneineninnovativenAnsatzvor,derzurInitialisierungbereits trainierte multilinguale neuronale Netze verwendet. Eine Visualisierung der Merkmalemittelst-SNEerlaubtes, einbesseresVersta¨ndnisfu¨rmultilinguale Bottle-NeckSprachmerkmalezuentwickeln. v VerbesserungderASRLeistungaufakzentbehafteterSprachemitHilfevon multilingualenundcrosslingualenInformationen: DieseArbeiterforschtdie Verwendung von multilingualen und crosslingualen Informationen zur Ver- besserungderASRLeistungaufakzentbehafteterSprache.Wirzeigen,dassein multilinguales ASR-System auf akzentbehafteter Sprache besser funktioniert alseinmonolingualesASR-System. AußerdemhabenwireineneueMethode, crosslingualaccentadaptation,entwickelt,diedieASRLeistungohneAdaptions- datenaufakzentbehafteterSpracheverbessert.MitdiesemAnsatzkonntenwir signifikanteVerbesserungengegenu¨berdemReferenzsystemerreichen. Akustische Modellierung basierend auf multilingualen Deep Neural Net- works: Diese Arbeit umfasst die Untersuchung multilingualer Deep Neural Network (DNN) fu¨r akustische Modellierung und ihre Anwendung auf neue Sprachen. WiruntersuchendenEffektderVerschmelzungdesPhonesetsbeim TrainingeinesDNNsundderKombinationvonmultilingualenDNNsmitKull- back-Leibler Divergenz Hidden Markov Model (KL-HMM) beim Dekodieren auf die ASR Leistung bei neuen Sprachen. Unsere Untersuchungen zeigen, dass KL-HMM basierte Dekodierung die ASR Leistung verbessert, insbeson- derewennTrainingsdatenfu¨rdieneueSprachenureingeschra¨nktvorhanden sind. Weiterhinhabenwirfestgestellt, dassdieVerschmelzungdesPhonesets aufIPA-BasiskeinenEffektaufdasmultilingualeDNNTraininghat. MultilingualeSprachmodellierungfu¨rCode-SwitchingSprache: Wirunter- suchendieIntegrationvonlinguistischenMerkmalenwieWortartenundSprachi- dentifikatoreninSprachmodellefu¨rCode-Switching.UnsereErgebnissezeigen, dassdieVerwendungdieserMerkmaleinverschiedenenSprachmodellierung- stechniken,wiez.B.rekurrenteneuronaleNetzeoderfaktorisierteSprachmod- elle,diePerplexita¨tdesSprachmodellsundauchdieFehlerratedesSpracherken- nersaufCode-Switchingverbessert. AußerdemliefertdieKombinationdieser beiden Techniken die beste Leistung auf unserem Testset. Schließlich zeigen wir,dassCode-Switching-Verhaltenssprecherabha¨ngigist. DaherliefertCode- Switching verhaltensabha¨ngige Sprachmodellierung weitere Verbesserungen aufdemCode-SwitchingDatenkorpus. DieBedeutungdieserDissertationwirdinZukunftnichtnurinderForschung sondernauchinderPraxissteigen. Zumeinenko¨nnendieErgebnissegenutzt werden,umKostenundEntwicklungszeitfu¨rdenBaueinesSpracherkenners fu¨r eine neue Sprache zu sparen. Zum anderen gewinnen die Arbeiten mit akzentbehaftetenSprachenundCode-SwitchingmehrBedeutungaufgrundder schnellzunehmendenGlobalisierung. vi Contents 1 Introduction 1 1.1 AspectsofmultilingualASR . . . . . . . . . . . . . . . . . . . . . . 1 1.2 HistoryofmultilingualASR . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Currentdevelopments . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Maincontributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Structureofthethesis . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Background 11 2.1 Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Languagesoftheworld . . . . . . . . . . . . . . . . . . . . 11 2.1.2 Linguisticdescriptionandclassification . . . . . . . . . . . 12 2.2 Automaticspeechrecognition . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Signalpreprocessing . . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 Acousticmodeling . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.3 Languagemodeling . . . . . . . . . . . . . . . . . . . . . . 25 2.2.4 Combiningacousticandlanguagemodels . . . . . . . . . 29 2.2.5 N-bestlistsandwordlattices . . . . . . . . . . . . . . . . . 29 2.2.6 Unsupervisedtrainingofacousticmodels . . . . . . . . . . 30 2.2.7 Acousticmodeladaptation . . . . . . . . . . . . . . . . . . 31 2.2.8 Evaluationcriteria . . . . . . . . . . . . . . . . . . . . . . . 34 vii Contents 3 Data,ToolsandBaseline(ASR)SystemsforMultipleLanguages 37 3.1 Datacorpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.1 GlobalPhonedatabase . . . . . . . . . . . . . . . . . . . . . 37 3.1.2 Non-nativespeechdatabase . . . . . . . . . . . . . . . . . . 40 3.1.3 SEAMEcorpus . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Speechrecognitionformultiplelanguages. . . . . . . . . . . . . . 44 3.2.1 Acousticmodeling . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.2 Languagemodeling . . . . . . . . . . . . . . . . . . . . . . 45 3.2.3 Languagespecificsystemoptimization . . . . . . . . . . . 47 4 Cross-languageBootstrappingBasedonCompletelyUnsupervised Training 53 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Relatedwork. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2.1 Unsupervisedandlightlyunsupervisedtraining . . . . . . 55 4.2.2 Confidencescore . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2.3 Cross-languagebootstrapping . . . . . . . . . . . . . . . . 57 4.3 Cross-languagemodelingbasedonphonemapping . . . . . . . . 58 4.3.1 Generalideaandimplementation . . . . . . . . . . . . . . 58 4.3.2 Experimentsandresults . . . . . . . . . . . . . . . . . . . . 59 4.4 MultilingualA-Stabil-AMultilingualConfidenceScore . . . . . 60 4.4.1 Investigationofconfidencescores . . . . . . . . . . . . . . 62 4.4.2 MultilingualA-Stabil . . . . . . . . . . . . . . . . . . . . . . 64 4.4.3 Thresholdselection . . . . . . . . . . . . . . . . . . . . . . . 66 4.5 Multilingualunsupervisedtrainingframework . . . . . . . . . . . 67 4.6 Experimentsandresults . . . . . . . . . . . . . . . . . . . . . . . . 69 4.6.1 Experimentalsetup . . . . . . . . . . . . . . . . . . . . . . . 69 4.6.2 Closelyrelatedlanguagesvsresource-richlanguages . . . 70 4.6.3 Under-resourcedlanguages-astudyforVietnamese . . . 74 viii

Description:

works: Diese Arbeit umfasst die Untersuchung multilingualer Deep Neural .. 4.1 Initial situation: We assume to have pronunciation dictionaries .. Indeed, Code-Switching is a challenging task for state-of-the-art speech tech- . Tamil. 66.0. 8. Russian. 145.0. 18. French. 64.8. 9. Japanese. 122.4. 19

Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and ... PDF

203 Pages·2014·3.27 MB·English

by Ngoc Thang Vu

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and ...

Description:

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.