Speech Recognition in Adverse Environments: a Probabilistic Approach by Trausti Thor Kristjansson Athesis presentedtotheUniversityofWaterloo infulfilmentofthe thesisrequirementforthedegreeof DoctorofPhilosophy in ComputerScience Waterloo,Ontario,Canada,2002 c TraustiThorKristjansson2002 ⃝ IherebydeclarethatIamthesoleauthorofthisthesis. IauthorizetheUniversityofWaterlootolendthisthesistootherinstitutionsorindi- vidualsforthepurposeofscholarlyresearch. IfurtherauthorizetheUniversityofWaterlootoreproducethisthesisbyphotocopy- ingorbyothermeans,intotalorinpart,attherequestofotherinstitutionsorindividuals forthepurposeofscholarlyresearch. ii TheUniversityofWaterloorequiresthesignaturesofallpersonsusingorphotocopy- ingthisthesis. Pleasesignbelow,andgiveaddressanddate. iii Abstract In this thesis I advocate a probabilistic view of robust speech recognition. I discuss the classificationofdistortedfeaturesusinganoptimalclassifier,andIshowhowthegener- ationofnoisyspeechcanberepresentedasagenerativegraphicalprobabilitymodel. By doingso,myaimistobuildaconceptualframeworkthatprovidesaunifiedunderstand- ing of robust speech recognition, and to some extent bridges the gap between a purely signalprocessingviewpointandthepatternclassificationordecodingviewpoint. The most tangible contribution of this thesis is the introduction of the Algonquin method for robust speech recognition. It exemplifies the probabilistic method and en- compasses a number of novel ideas. For example, it uses a probability distribution to describe the relationship between clean speech, noise, channel and the resultant noisy speech. Itemploysavariationalapproachtofindanapproximationtothejointposterior distributionwhichcanbeusedforthepurposeofrestoringthedistortedobservations. It also allows us to estimate the parameters of the environment using a Generalized EM method. Another important contribution of this thesis is a new paradigm for robust speech recognition, which we call uncertainty decoding. This new paradigm follows naturally from the standard way of performing inference in the graphical probability model that describesnoisyspeechgeneration. iv Acknowledgements First and foremost I would like thank my advisor Brendan Frey. One of the most enjoyable aspect of the last four years has been to discuss research ideas with him. His insightintocomplextheoreticalandtechnicalissuesneverceasestoimpressme. I would like to thank Li Deng and Alex Acero who made my visits at Microsoft Research exciting and productive. I benefited greatly from their wisdom and guidance overthelasttwoyears. Iamalsoindebtedtomythesiscommittee,PedroMoreno,DaleSchuurmans,Richard Mann and Paul Fieguth, who provided me with valuable suggestions during the last stagesofmythesiswork. MystayinWaterloooverthelastthreeyearswasallthemorefunduetomyfriends who always had time for an interesting discussion over a cup of coffee. I am especially luckytocountChrisPalandSarahDorneramongmyfriends. I thank my family for their love and support through the years. My mother Jona, sister Berglind and brother Arnar for giving me a home in Iceland to return to and my fatherKristjanforactionfilledtripstoFlorida. ThisthesisisdedicatedtoSusan,forherlove,supportandpatience. v Contents 1 IntroductionandOverview 1 1.1 ThesisOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 SpeechRecognitioninAdverseEnvironments 5 2.1 IntroductiontoAutomaticSpeechRecognition . . . . . . . . . . . . . 5 2.2 EffectsoftheEnvironment . . . . . . . . . . . . . . . . . . . . . . . . 7 3 ClassifyingCorruptedObservations 11 3.1 ClassificationofCorruptedObservations . . . . . . . . . . . . . . . . . 11 3.2 ClassifyingCorruptedObservations . . . . . . . . . . . . . . . . . . . 13 3.2.1 CorrectingforBias. . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.2 MaximumLikelihoodPointEstimates: Usingp(y x) . . . . . . 13 | 3.2.3 MaximumAPosterioriPointEstimates: Usingp(y x)andp(x) 16 | 3.2.4 ClassificationBasedonp(y s) . . . . . . . . . . . . . . . . . . 16 | 3.3 TheProbabilityofError. . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.1 ANumericalExample . . . . . . . . . . . . . . . . . . . . . . 18 vi 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4 GraphicalProbabilityModelsforRobustASR 21 4.1 AGenerativeGraphicalModelforNoisySpeech . . . . . . . . . . . . 21 4.2 ApproachestoNoiseRobustSpeechRecognition . . . . . . . . . . . . 25 4.2.1 FeatureCleaning . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2.2 ModelAdaptation . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.3 UncertaintyDecoding . . . . . . . . . . . . . . . . . . . . . . 27 5 MFCTransformandtheInteractionLikelihood 28 5.1 TheMelFrequencyCepstrumTransform. . . . . . . . . . . . . . . . . 28 5.2 TheMFCTransformoftheInteractionEquation . . . . . . . . . . . . . 33 5.3 TheInteractionLikelihood . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3.1 FixedVarianceInteractionLikelihood . . . . . . . . . . . . . . 36 5.3.2 InteractionLikelihoodwithMagnitudeDependentVariance . . 37 5.3.3 EmpiricalplotofInteractionLikelihood . . . . . . . . . . . . . 40 6 InferenceinNon-GaussianNetworks 42 6.1 TheComponentModels . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.1.1 TheSpeechModel . . . . . . . . . . . . . . . . . . . . . . . . 43 6.1.2 TheNoiseModel . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.1.3 TheChannelModel. . . . . . . . . . . . . . . . . . . . . . . . 44 6.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 vii 7 PerformanceEvaluationandthePriorArt 48 7.1 MeasuringPerformance . . . . . . . . . . . . . . . . . . . . . . . . . . 49 7.1.1 TheAuroraEvaluationFramework . . . . . . . . . . . . . . . 50 7.1.2 TheAurora-HTKRecognizer . . . . . . . . . . . . . . . . . . 52 7.2 ThePriorArt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.2.1 FeatureCleaning . . . . . . . . . . . . . . . . . . . . . . . . . 53 7.2.2 ModelAdaptation . . . . . . . . . . . . . . . . . . . . . . . . 56 7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 8 TheAlgonquinFramework 61 8.1 TheAlgonquinFramework . . . . . . . . . . . . . . . . . . . . . . . . 62 8.1.1 LinearizationoftheInteractionLikelihood . . . . . . . . . . . 62 8.1.2 ThePosteriorq (z) . . . . . . . . . . . . . . . . . . . . . . 64 y obs 8.1.3 TheParametersofq (z) . . . . . . . . . . . . . . . . . . . . 67 yobs 8.1.4 UpdatingtheLinearizationPoint . . . . . . . . . . . . . . . . . 68 8.1.5 ConvergencePropertiesoftheAlgorithm . . . . . . . . . . . . 70 8.1.6 VariationalInference . . . . . . . . . . . . . . . . . . . . . . . 72 8.1.7 TheNegativeRelativeEntropy . . . . . . . . . . . . . . . . 73 F 8.2 MMSEAlgonquinResultsonAuroraDatabase . . . . . . . . . . . . . 74 8.3 EffectofSpeechModelSize . . . . . . . . . . . . . . . . . . . . . . . 76 8.4 EffectofNoiseModelSize . . . . . . . . . . . . . . . . . . . . . . . . 77 8.5 UsingFactorizedVersionsofq . . . . . . . . . . . . . . . . . . . . . . 78 8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 viii 9 LearningEnvironmentalParameters 82 9.1 JointLearningofNoiseandChannelDistortion . . . . . . . . . . . . . 83 9.1.1 AGeneralizedEMMethodforParameterAdaptation . . . . . . 84 9.1.2 Learningp(n). . . . . . . . . . . . . . . . . . . . . . . . . . . 85 9.1.3 Learningp(h). . . . . . . . . . . . . . . . . . . . . . . . . . . 88 9.2 ConvergenceProperties . . . . . . . . . . . . . . . . . . . . . . . . . . 88 9.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 9.3.1 SetA:Learningp(n) . . . . . . . . . . . . . . . . . . . . . . . 91 9.3.2 SetC:JointLearningofp(n)andp(h) . . . . . . . . . . . . . . 92 9.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 10 TakingUncertaintyintoAccount 98 10.1 DistributionsasObservations . . . . . . . . . . . . . . . . . . . . . . . 99 10.2 TheEffectofUncertainty . . . . . . . . . . . . . . . . . . . . . . . . . 101 10.3 UncertaintyDecoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 10.4 Estimateofp(s y)/p(s) . . . . . . . . . . . . . . . . . . . . . . . . . . 105 | 10.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 10.6 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 10.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 11 Conclusion 112 11.1 SummaryandContributions . . . . . . . . . . . . . . . . . . . . . . . 112 11.2 FutureExtensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 ix A NotationalConventions 116 B UsefulIdentities 118 B.1 GaussianQuadraticIntegralForms . . . . . . . . . . . . . . . . . . . . 118 B.2 MultiplicationofTwoGaussians . . . . . . . . . . . . . . . . . . . . . 118 B.3 ChangeofVariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 B.4 MultiplicationofLinearLikelihoodandGaussian . . . . . . . . . . . . 119 B.5 MatrixCalculusResults . . . . . . . . . . . . . . . . . . . . . . . . . . 120 C Derivations 122 C.1 Log-SpectrumDomainAlgonquin . . . . . . . . . . . . . . . . . . . . 122 C.1.1 FirstTerm: Equation(A-15) . . . . . . . . . . . . . . . . . . . 125 C.1.2 SecondTerm: Equation(A-16) . . . . . . . . . . . . . . . . . 126 C.1.3 ThirdTerm: Equation(A-17) . . . . . . . . . . . . . . . . . . 127 C.1.4 FourthTerm: Equation(A-18) . . . . . . . . . . . . . . . . . . 127 C.1.5 FifthTerm: Equation(A-19) . . . . . . . . . . . . . . . . . . . 128 C.1.6 TheNegativeRelativeEntropyF . . . . . . . . . . . . . . . . . 129 C.1.7 Derivationofη . . . . . . . . . . . . . . . . . . . . . . . . . . 130 C.1.8 DerivationofΦ . . . . . . . . . . . . . . . . . . . . . . . . . . 130 C.1.9 Derivationofρ . . . . . . . . . . . . . . . . . . . . . . . . . . 131 D Results 133 D.1 ComparisonofLog-SpectrumandCepstrumFeatures . . . . . . . . . . 133 x
Description: