Microphone Array Processing for Robust Speech Recognition Michael L. Seltzer Thesis Committee: Richard M. Stern, Chair Tsuhan Chen Gary W. Elko B. V. K. Vijaya Kumar Submitted to the Department of Electrical and Computer Engineering in partial ful(cid:12)llment of the requirements for the degree of Doctor of Philosophy at Carnegie Mellon University Pittsburgh, PA 15213 July 2003 Copyright c 2003 Michael L. Seltzer (cid:13) To my parents. Abstract Speechrecognitionperformancedegradessigni(cid:12)cantlyindistant-talkingenvironments,where the speech signals can be severely distorted by additive noise and reverberation. In such environments, the use of microphone arrays has been proposed as a means of improving the quality of captured speech signals. Currently, microphone-array-based speech recogni- tion is performed in two independent stages: array processing and then recognition. Array processing algorithms designed for signal enhancement are applied in order to reduce the distortion in the speech waveform prior to feature extraction and recognition. This approach assumes that improving the quality of the speech waveform will neces- sarily result in improved recognition performance. However, speech recognition systems are statisticalpatternclassi(cid:12)ersthatprocessfeaturesderivedfromthespeechwaveform,notthe waveform itself. An array processing algorithm can therefore only be expected to improve recognition if it maximizes or at least increases the likelihood of the correct hypothesis, relative to other competing hypotheses. In this thesis a new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of the correct hypothesis. In this approach, called Likelihood Maximizing Beamforming (LIMABEAM), information from the speech recognition system itself is used to optimize a (cid:12)lter-and-sum beamformer. Using LIMABEAM, signi(cid:12)cant improvements in recognition accuracy over conventional array processing approaches are obtained in moderately reverberant environments over a wide range of signal-to-noise ratios. However, only limited improvements are obtained in environments with more severe reverberation. To address this issue, a subband (cid:12)ltering approach to LIMABEAM is proposed, called Subband-Likelihood Maximizing Beamforming (S-LIMABEAM). S-LIMABEAM employs a newsubband(cid:12)lter-and-sumarchitecturewhichexplicitlyconsidershowthefeaturesusedfor recognition are computed. This enables S-LIMABEAM to achieve dramatically improved performance over the original LIMABEAM algorithm in highly reverberant environments. Becausethealgorithmsinthisthesisaredata-driven,theydonotrequirea priori knowl- edge of the room impulse response, nor any particular number of microphones or array geometry. To demonstrate this, LIMABEAM and S-LIMABEAM are evaluated using mul- tiple array con(cid:12)gurations and environments including an array-equipped personal digital assistant (PDA) and a meeting room with a few tabletop microphones. In all cases, the proposed algorithms signi(cid:12)cantly outperform conventional array processing approaches. i Acknowledgments This thesis would not have been possible without the support and encouragement of many people. To the following people, I owe an enormous debt of gratitude. My advisor, Professor Richard Stern, who welcomed me into his group when I arrived at CMU knowing absolutely nothing about speech recognition. Rich gave me the guidance I needed to become a capable researcher and the freedom to pursue my own ideas. He has been a role model of professionalism and integrity, as well as a good friend. Dr. Gary Elko, whose expertise in microphone arrays and acoustics was invaluable. He was very generous with his time and always quick to respond to my questions. ProfessorB.V.K.VijayaKumarand ProfessorTsuhan Chen, whomademanyvaluable suggestions that really improved the quality of this dissertation. ThemembersoftheCMURobustSpeechGroup,pastandpresent,fromwhomIlearned a great deal and with whom I shared many laughs. Rita Singh, Juan Huerta, and Sam-Joo Doh never grew tired of answering my endless stream of questions during my (cid:12)rst two years here. Jon Nedel, Evandro Gouv^ea, Xiang Li, and Pablo Hennings were always around for discussions and their camaraderie created a great work environment. Bhiksha Raj, who has been a mentor, a collaborator, and a good friend. He taught me not to shy away from tough research problems and his positive attitude in the face of some di(cid:14)cult circumstances amazes and inspires me. Tom Sullivan and Yasunari Obuchi, who recorded the microphone array data used in this thesis, and the ICSI speech group, who graciously gave me their meeting room data. Kevin Dixon, Jay Wylie, Dan Gaugel, Stefanie Tomko, Mike Stout, George Lopez, and RichMalak, whosefriendshipwasaconstantreminderthatthereismoretograduateschool than research and work. They were a limitless source of support, fun, and laughs, and I feel extremely lucky to be able to call them my friends. Julie, who has given me more love and encouragement during this process than anyone can possibly ask for. Her boundless support kept me going when things were di(cid:14)cult, and I can’t wait to start the next chapter of my life with her. My sister Gabrielle, who has been my biggest cheerleader throughout my life and to whom I have grown closer over the past (cid:12)ve years, despite the distance between us. Finally, there is no way I would be where I am today without the immeasurable love, support, and encouragement of my parents. I cannot thank them enough for all the oppor- tunitiestheyhavegivenmeinmylifeandforalwaysbelievinginme. Theyareanincredible inspiration to me. iii Table of Contents Abstract i Acknowledgments iii Table of Contents viii List of Figures xi List of Tables xiv 1 Introduction 1 1.1 What this thesis is about . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 A Review of Automatic Speech Recognition 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 HMM-based Automatic Speech Recognition . . . . . . . . . . . . . . . . . . 7 2.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 HMM-based Modeling of Distributions of Feature Vectors . . . . . . 9 2.3 ASR Performance in Distant-talking Environments . . . . . . . . . . . . . . 14 2.3.1 The E(cid:11)ect of Additive Noise on Recognition Accuracy . . . . . . . . 14 2.3.2 The E(cid:11)ect of Reverberation on Recognition Accuracy . . . . . . . . 15 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 Microphone Array Processing for Speech Recognition 21 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Fundamentals of Array Processing . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Microphone Array Processing Approaches . . . . . . . . . . . . . . . . . . . 26 3.3.1 Classical Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . 26 v 3.3.2 Adaptive Array Processing . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.3 Additional Microphone Array Processing Methods . . . . . . . . . . 28 3.4 Speech Recognition Compensation Methods . . . . . . . . . . . . . . . . . . 31 3.4.1 CDCN and VTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4.2 Maximum Likelihood Linear Regression . . . . . . . . . . . . . . . . 32 3.5 Speech Recognition with Microphone Arrays. . . . . . . . . . . . . . . . . . 32 3.5.1 Experimental Setup and Corpora . . . . . . . . . . . . . . . . . . . . 33 3.5.2 Evaluating Performance and Determining Statistical Signi(cid:12)cance . . 34 3.5.3 Recognition Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4 Likelihood Maximizing Beamforming 39 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Filter-and-Sum Array Processing . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Likelihood Maximizing Beamforming (LIMABEAM) . . . . . . . . . . . . . 41 4.4 Optimizing the State Sequence . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5 Optimizing the Array Parameters . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.1 Gaussian State Output Distributions . . . . . . . . . . . . . . . . . . 44 4.5.2 Mixture of Gaussians State Output Distributions . . . . . . . . . . . 45 4.5.3 Gradient-based Array Parameter Optimization . . . . . . . . . . . . 46 4.6 Evaluating LIMABEAM Using Oracle State Sequences . . . . . . . . . . . . 46 4.6.1 Experiments Using Gaussian Distributions. . . . . . . . . . . . . . . 47 4.6.2 Experiments Using Mixtures of Gaussians . . . . . . . . . . . . . . . 50 4.6.3 Summary of Results Using Oracle LIMABEAM . . . . . . . . . . . . 51 4.7 The Calibrated LIMABEAM Algorithm . . . . . . . . . . . . . . . . . . . . 51 4.7.1 Experimental Results Using Calibrated LIMABEAM . . . . . . . . . 54 4.7.2 Summary of Results Using Calibrated LIMABEAM . . . . . . . . . 56 4.8 The Unsupervised LIMABEAM Algorithm . . . . . . . . . . . . . . . . . . 57 4.8.1 Experimental Results Using Unsupervised LIMABEAM . . . . . . . 58 4.8.2 Summary of Results Using Unsupervised LIMABEAM . . . . . . . . 61 4.9 Analysis of Optimized Array Parameters and the Output Waveform . . . . 61 4.9.1 The Optimized Filters of the Array . . . . . . . . . . . . . . . . . . . 62 4.9.2 The Array Output Waveform . . . . . . . . . . . . . . . . . . . . . . 65 4.10 Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.10.1 Incorporating Feature Mean Normalization . . . . . . . . . . . . . . 65 4.10.2 Sum-and-Filter Processing . . . . . . . . . . . . . . . . . . . . . . . . 67 4.10.3 Combining LIMABEAM with Other Compensation Techniques . . . 68 4.10.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . 69 4.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5 Subband-Likelihood Maximizing Beamforming 73 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 LIMABEAM in Highly Reverberant Environments . . . . . . . . . . . . . . 74 5.3 An Overview of Subband Adaptive Filtering . . . . . . . . . . . . . . . . . . 76 5.4 Subband Filtering for Microphone-array-based Speech Recognition . . . . . 78 5.4.1 Incorporating Subband Processing into the ASR Front-End . . . . . 78 5.4.2 Subband Filter-and-Sum Array Processing . . . . . . . . . . . . . . . 79 5.5 Subband-Likelihood Maximizing Beamforming (S-LIMABEAM). . . . . . . 80 5.5.1 Feature-based Subband Filtering . . . . . . . . . . . . . . . . . . . . 81 5.5.2 Maximum Likelihood Estimation of Subband Filter Parameters . . . 81 5.6 Optimizing the Subband Filter Parameters . . . . . . . . . . . . . . . . . . 82 5.6.1 Gaussian State Output Distributions . . . . . . . . . . . . . . . . . . 84 5.6.2 Mixture of Gaussians State Output Distributions . . . . . . . . . . . 84 5.7 Analysis of the Dimensionality of Subband Filtering . . . . . . . . . . . . . 85 5.8 Applying S-LIMABEAM to Reverberant Speech . . . . . . . . . . . . . . . 86 5.9 Evaluating S-LIMABEAM Using Oracle State Sequences . . . . . . . . . . . 87 5.10 The Calibrated S-LIMABEAM Algorithm . . . . . . . . . . . . . . . . . . . 89 5.10.1 Experimental Results Using Calibrated S-LIMABEAM. . . . . . . . 90 5.11 The Unsupervised S-LIMABEAM Algorithm . . . . . . . . . . . . . . . . . 94 5.11.1 Experimental Results Using Unsupervised S-LIMABEAM . . . . . . 94 5.12 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.13 Dimensionality Reduction via Parameter Sharing . . . . . . . . . . . . . . . 98 5.13.1 Sharing Parameters Within Mel Spectral Components . . . . . . . . 99 5.13.2 Sharing Parameters Across Mel Spectral Components . . . . . . . . 99 5.13.3 Experimental Results Using Parameter Sharing . . . . . . . . . . . . 100 5.14 S-LIMABEAM in Environments with Low Reverberation . . . . . . . . . . 100 5.15 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6 LIMABEAM in Other Multi-Microphone Environments 105 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2 Multi-microphone Speech Recognition on a PDA . . . . . . . . . . . . . . . 106 6.3 The CMU WSJ PDA corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.3.1 Experimental Results Using LIMABEAM . . . . . . . . . . . . . . . 108
Description: