ebook img

ALI-DISSERTATION-2013 PDF

146 Pages·2013·1.35 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ALI-DISSERTATION-2013

VOICE QUERY-BY-EXAMPLE FOR RESOURCE-LIMITED LANGUAGES USING AN ERGODIC HIDDEN MARKOV MODEL OF SPEECH ADissertation Presentedto TheAcademicFaculty By AsifAli InPartialFulfillment oftheRequirementsfortheDegree DoctorofPhilosophy in ElectricalandComputerEngineering SchoolofElectricalandComputerEngineering GeorgiaInstituteofTechnology December2013 Copyright©2013byAsifAli VOICE QUERY-BY-EXAMPLE FOR RESOURCE-LIMITED LANGUAGES USING AN ERGODIC HIDDEN MARKOV MODEL OF SPEECH Approvedby: ProfessorMarkA.Clements,Advisor ProfessorJohnCopeland Professor,SchoolofECE Professor,SchoolofECE GeorgiaInstituteofTechnology GeorgiaInstituteofTechnology ProfessorChin-HuiLee ProfessorAlexandarLerch Professor,SchoolofECE Asst. Professor,SchoolofMusic GeorgiaInstituteofTechnology GeorgiaInstituteofTechnology ProfessorDavidV.Anderson Professor,SchoolofECE GeorgiaInstituteofTechnology DateApproved: 14November2013 Tofamilyandgoodfriends ACKNOWLEDGMENTS I would like to express my sincere gratitude to my thesis advisor, Professor Mark A. Clements, from whom I have learned so much, about life and research. He has been a trueinspiration,andIthankhimforhisguidanceinhelpingmedevelopasaresearcherand asanindividual. Next, I would like to thank the members of my thesis committee: Prof. Chin-Hui Lee, Prof. David V. Anderson, Prof. John Copeland, and Prof. Alexander Lerch, for their many insightfulcommentsandsuggestions. IwouldalsoliketoacknowledgeandthankNexidia Inc. fortheircontributionsandhelpinthisresearch. IwouldalsoliketothanktheFulbrightScholarshipCommission,anditsfundingagen- cies,theU.S.DepartmentofState’sBureauofEducationalandCulturalAffairs,andHigher EducationCommission,Pakistan,forsupportingthisresearch. Iwouldalsoliketoexpress my gratitude to the scholarship administrators at Institute of International Education and UnitedStatesEducationFoundation,Pakistan. IwouldliketothankJenniferLunsfordatCenterforSignalandInformationProcessing, andDr. DanielaStaiculescuatECEgraduateoffice. Iwishtothankmyfellowgraduatestu- dentsandresearchers,HrishikeshRao,AhmadBeirami,MehrezSouden,SamirRustamov, JonathanKimandmanyothers,fortheirfriendship,help,andcomraderie. Most importantly, I wish to thank my family: my parents, my lovely wife, my brothers andsisters,forlovingme,forbelievinginme,andforsupportingmeeverystepoftheway. iv TABLE OF CONTENTS ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LISTOFTABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LISTOFFIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1 OriginandHistoryoftheProblem . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 SpeechProduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.2 AcousticPropertiesofPhonemes . . . . . . . . . . . . . . . . . . 6 1.1.3 CommonSourcesofVariationinSpeech . . . . . . . . . . . . . . 8 1.1.4 ManifoldModelofObservationDistribution . . . . . . . . . . . . 10 1.2 AnOverviewofQuery-by-ExampleSpokenTermDetectionSystems . . . 11 1.2.1 UnsupervisedApproachestoSpokenTermDetection . . . . . . . 13 1.2.2 SupervisedStatisticalModelingTechniques . . . . . . . . . . . . 16 1.2.3 UnsupervisedDensityModelingSchemes . . . . . . . . . . . . . 20 1.3 DevelopmentandTestDatabases . . . . . . . . . . . . . . . . . . . . . . 22 1.3.1 TIDIGITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.3.2 TheWallStreetJournalCorpus . . . . . . . . . . . . . . . . . . . 23 1.3.3 TIMITCorpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.3.4 TheFisherCorpus . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.3.5 Hausa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.3.6 JohnsHopkinsUniversityCLSBCollection . . . . . . . . . . . . 25 1.3.7 MediaEval2013SpokenWebSearchDatabase . . . . . . . . . . 25 CHAPTER2 ANERGODICHIDDENMARKOVMODELOFSPEECH . . 27 2.1 AnEHMMofspeech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.1.1 NumberofStatesinanEHMM . . . . . . . . . . . . . . . . . . . 29 2.1.2 ModelingtheObservationDensity . . . . . . . . . . . . . . . . . 32 2.1.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.2 SpectralCharacteristicsofStates . . . . . . . . . . . . . . . . . . . . . . 36 2.2.1 AnEHMMofHausa . . . . . . . . . . . . . . . . . . . . . . . . 38 2.2.2 ClassificationofStatesbasedonSpectralCharacteristics . . . . . 40 2.3 CapturingtheTemporalCharacteristicsofSpeech . . . . . . . . . . . . . 41 CHAPTER3 KEYWORDSPOTTINGUSINGANEHMMOFSPEECH . . . 46 3.1 DynamicTimeWarping . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 TheViterbiAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3 ComparisonbetweenKeywordSearchAlgorithms . . . . . . . . . . . . . 50 3.3.1 ExperimentsonWSJCorpus . . . . . . . . . . . . . . . . . . . . 51 v 3.3.2 ExperimentsonTIMITCorpus . . . . . . . . . . . . . . . . . . . 54 CHAPTER4 KEYWORDSPOTTINGINMULTI-SPEAKERENVIRONMENT 56 4.1 SpeakerInvarianceSchemes . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1.1 TimeDomain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.1.2 FrequencyDomain . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 ManifoldModelofObservationDensity . . . . . . . . . . . . . . . . . . 62 4.2.1 ClusteringObservationsonaManifold . . . . . . . . . . . . . . . 64 4.3 StateParameterEstimationforHMM . . . . . . . . . . . . . . . . . . . . 66 4.3.1 LearningbyExample . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.2 Nearest-neighborClustering . . . . . . . . . . . . . . . . . . . . 69 4.3.3 ClusteringusinganIdentity-PreservingOperation . . . . . . . . . 70 4.3.4 HierarchicalClustering . . . . . . . . . . . . . . . . . . . . . . . 73 CHAPTER5 DESIGN OF VOICE-QUERY-BY-EXAMPLE SYSTEM USING ANEHMMOFSPEECH . . . . . . . . . . . . . . . . . . . . . . 75 5.1 ACISI-EHMMofSpeech . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.1.1 EstimatingtheParametersoftheCISI-EHMM . . . . . . . . . . . 77 5.1.2 KeywordModelandKeywordSearchAlgorithm . . . . . . . . . 78 5.2 GraphicalKeywordModeland3DViterbiAlgorithm . . . . . . . . . . . 79 5.2.1 A3DViterbiAlgorithm . . . . . . . . . . . . . . . . . . . . . . . 81 5.2.2 Length-normalizedOptimalityCriteria . . . . . . . . . . . . . . 82 5.2.3 LengthEstimation . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2.4 FastViterbialgorithmusingToken-passing . . . . . . . . . . . . 84 CHAPTER6 EXPERIMENTSANDRESULTS . . . . . . . . . . . . . . . . . 87 6.1 ComparisonofDifferentClusteringSchemes . . . . . . . . . . . . . . . . 88 6.1.1 TestSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.1.2 Nearest-neighborClusteringScheme . . . . . . . . . . . . . . . . 91 6.1.3 ClusteringusingDynamicFrequencyWarping . . . . . . . . . . . 94 6.1.4 ComparisonofClusteringSchemes . . . . . . . . . . . . . . . . . 95 6.2 ExperimentsonJHUCLSPWordCollection . . . . . . . . . . . . . . . . 98 6.2.1 TestSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2.2 EvaluationMetric . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2.3 IssuesSpecifictoHMM-basedSystems . . . . . . . . . . . . . . 100 6.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3 ExperimentsonMediaEval2013Corpus . . . . . . . . . . . . . . . . . . 104 6.3.1 TestSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.3.2 EvaluationMetrics . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.3.4 FinalTermWeightedValueforMediaEvalSWS2013 . . . . . . . 114 CHAPTER7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 vi References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 vii LIST OF TABLES Table1 Averagelog-likelihoodperframefordifferentEHMMs . . . . . . . . . . 30 Table2 Probabilityofdetectionfordifferentclusteringschemes. . . . . . . . . . 92 Table3 Probabilityofdetectionfordifferentclustersizesandclusteringschemes. 94 Table4 Average precision of an EHMM-based system vs. systems evaluated at JHUWorkshop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Table5 PrecisionofEHMMafterdifferentleveloftraining . . . . . . . . . . . . 108 Table6 MTWV for the development set as a function of number of states in the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Table7 MTWV for the development set as a function of number of clusters in themodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Table8 Calculated TWV for a single trial using different score normalization schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Table9 MTWVsforzero-resourcesystemssubmittedtoMediaEval2013. . . . . 115 Table10 MTWVofopensystemssubmittedtoMediaEval2013 . . . . . . . . . . 117 Table11 MTWVforthedevelopmentsetinextendedtrials. . . . . . . . . . . . . 119 viii LIST OF FIGURES Figure1 Thespectrogramofthevowel/AO/. . . . . . . . . . . . . . . . . . . . . 6 Figure2 Vowelchartindicatingthepositionoffirsttwoformants . . . . . . . . . 8 Figure3 A3-stateleft-rightHMM. . . . . . . . . . . . . . . . . . . . . . . . . . 17 Figure4 A4-stateergodicHMM. . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Figure5 Mean values of states of the EHMM corresponding to a vowel (a) and a sibilant(b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Figure6 Comparisonofnasalconsonant/N/inHausaandNorthAmericanEnglish 40 Figure7 AnergodicHMMofEnglishlanguagetrainedontheTIMITcorpus . . . 43 Figure8 Spectrumoftwostateswithdifferentaveragedurations . . . . . . . . . . 44 Figure9 Truepositiveratevs. errorsperhourforWSJtrials. . . . . . . . . . . . . 53 Figure10 Posteriogramsfortwoinstancesoftheutterance“California”. . . . . . . 55 Figure11 Gender-specificvariationsinthespectrumofavowel . . . . . . . . . . . 57 Figure12 HistogramofpitchvaluesfordifferentstatesofEHMM . . . . . . . . . 59 Figure13 Observationdistributionintheformofmanifold . . . . . . . . . . . . . 63 Figure14 Classificationofobservationsintheformofamanifold . . . . . . . . . . 64 Figure15 Isolatingtheclassesusingexpertsystems . . . . . . . . . . . . . . . . . 65 Figure16 DynamicFrequencyWarping . . . . . . . . . . . . . . . . . . . . . . . 72 Figure17 SimilarstatesdiscoveredbytheDFW-basedclusteringalgorithm . . . . 73 Figure18 ThecompositionofstatesofaCISI-EHMMofspeech. . . . . . . . . . . 76 Figure19 StatisticalmodelofkeywordbasedonanEHMMofSpeech . . . . . . . 80 Figure20 Athree-dimensionallatticeforgraphicalkeywordmodel . . . . . . . . . 82 Figure21 TheViterbiwindowforahypotheticalmatch . . . . . . . . . . . . . . . 83 Figure22 Thetoken-passingalgorithm . . . . . . . . . . . . . . . . . . . . . . . . 85 Figure23 DetectionprobabilityforqueriesinGroup3. . . . . . . . . . . . . . . . 93 Figure24 Plotoftruepositiverateforkeywordgroup3. . . . . . . . . . . . . . . . 95 ix Figure25 LatticeStructureforDTW . . . . . . . . . . . . . . . . . . . . . . . . . 101 Figure26 Histogramoflength-normalizedscoresfortwodifferenttrials . . . . . . 112 Figure27 TWVforlength-normalizedscores . . . . . . . . . . . . . . . . . . . . 113 Figure28 MaximumTermWeightedValueforDevelopmentandEvaluationSets. . 114 x

Description:
an utterance while safely ignoring those that do not alter its meaning. A brief The speakers were largely recorded in a studio environment, of approximately 3 hours of speech from Hindi, Telugu, and Gujarati languages. The. 25
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.