ebook img

DTIC ADA458653: Multiple Approaches to Robust Speech Recognition PDF

7 Pages·0.65 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA458653: Multiple Approaches to Robust Speech Recognition

MULTIPLE APPROACHES TO ROBUST SPEECH RECOGNITION Richard M. Stern, Fu-Hua Liu, Yoshiaki Ohshima, Thomas M. Sullivan, Alejandro Acero* Department of Electrical and Computer Engineering School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 ABSTRACT our results for the DARPA benchmark evaluation for robust speech recognition for the ATIS task, discussing the This paper compares several different approaches to robust effectiveness of our methods of acoustical pre- speech recognition. We review CMU's ongoing research in the preprocessing in the context of this task. Second, we use of acoustical pre-proeessing to achieve robust speech recog- describe and compare the effectiveness of three com- nition, and we present the results of the first evaluation of pre- plementary methods of signal processing for robust speech processing in the context of the DARPA standard ATIS domain recognition: acoustical pre-processing, microphone array for spoken language systems. We also describe and compare the processing, and the use of physiologically-motivated models of peripheral signal processing. effectiveness of three complementary methods of signal process- ing for robust speech recognition: acoustical pre-procossing, microphone array processing, and the use of physiologically- 2. ACOUSTICAL PRE-PROCESSING motivated models of peripheral signal processing. Recognition error rates are presented using these three approaches in isolation We have found that two major factors degrading the per- and in combination with each other for the speaker-independent formance of speech recognition systems using desktop continuous alphanumeric census speech recognition task. microphones in normal office environments are additive noise and unknown linear filtering. We showed in 2 that simultaneous joint compensation for the effects of additive 1. INTRODUCTION noise and linear filtering is needed to achieve maximal robustness with respect to acoustical differences between The need for speech recognition systems and spoken lan- the training and testing environments of a speech recog- guage systems to be robust with respect to their acoustical nition system. We described in 2 two algorithms that can environment has become more widely appreciated in perform such joint compensation, based on additive correc- recent years (e.g. 1). tions to the cepstral coefficients of the speech waveform. Results of several studies have demonstrated that even The first compensation algorithm, SNR-Dependent automatic speech recognition systems that are designed to Cepstral Normalization (SDCN), applies an additive cor- be speaker independent can perform very poorly when they rection in the cepstral domain that depends exclusively on are tested using a different type of microphone or acous- the instantaneous SNR of the signal. This correction vec- tical environment from the one with which they were tor equals the average difference in cepstra between simul- trained (e.g. 2, 3), even in a relatively quiet office en- taneous "stereo" recordings of speech samples from both vironment. Applications such as speech recognition over the training and testing environments at each SNR of telephones, in automobiles, on a factory floor, or outdoors speech in the testing environment. At high SNRs, this demand an even greater degree of environmental robust- correction vector primarily compensates for differences in ness. spectral flit between the training and testing environments (in a manner similar to the blind deconvolution procedure The CMU speech group is committed to the development first proposed by Stockham et .la 6), while at low SNRs of speech recognition systems that are robust with respect the vector provides a form of noise subtraction (in a man- to environmental variation, just as it has been an early ner similar to the spectral subtraction algorithm first proponent of speaker-independent recognition. While most proposed by Boll 7). The SDCN algorithm is simple and of our work presented to date has described new acoustical effective, but it requires environment-specific training. pre-processing algorithms (e.g. 2, 4, 5, we have always regarded pre-processing as one of several approaches that The second compensation algorithm, Codeword- must be developed in concert to achieve robust recog- Dependent Cepstral Normalization (CDCN), uses EM nition. techniques to compute ML estimates of the parameters characterizing the contributions of additive noise and The purpose of this paper is twofold. First, we describe linear filtering that when applied in inverse fashion to the tneserP* address: Telef6nica Invesfigaci6n y Desarrollo, Emilio Vargas ,6 Madrid 28043, Spain 274 Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 1992 2. REPORT TYPE 00-00-1992 to 00-00-1992 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Multiple Approaches to Robust Speech Recognition 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Carnegie Mellon University,School of Computer REPORT NUMBER Science,Pittsburgh,PA,15213 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 6 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 cepstra of an incoming utterance produce an ensemble of ALGO- ENVIRN. COM- ERR cepstral coefficients that best match (in the ML sense) the RITHM SPEC? PLEXITY RATE cepstral coefficients of the incoming speech in the testing environment to the locations of VQ codewords in the train- NONE NO NONE 68.6% ing environment. The CDCN algorithm has the advantage SDCN YES MINIMAL 27.6% that it does not require a priori knowledge of the testing environment (in the form of stereo training data in the CDCN NO GREATER 24.3% training and testing environments), but it is much more computationally demanding than the SDCN algorithm. BSDCN NO MINIMAL 30.0% Compared to the SDCN algorithm, the CDCN algorithm uses a greater amount of structural knowledge about the nature of the degradations to the speech signal in order to Table 1: Comparison of recognition accuracy of XNIHPS achieve good recognition accuracy. The SDCN algorithm, with no processing and the CDCN, SDCN, and BSDCN on the other hand, derives its compensation vectors en- algorithms. The system was trained using the CLSTLK tirely from empirical observations of differences between microphone and tested using the PZM6FS microphone. atad obtained from the training and testing environments. Training and testing on the CLSTLK produces a recog- nition accuracy of 86.9%, while training and testing on the 100 PZM6FS produces 76.2% O. Train CLSTLK, Tul PZM6FS 6O SNRs in the training and testing environments by use of % traditional nonlinear warping techniques 8 on histograms 40 "0... of SNRs from each of the two environments 5. Table 1 ................... ~ ..... .~..Q compares the environmental specificity, computational 20 .......... complexity, and recognition accuracy of these algorithms ! ! Train and Toe! CLSTLK when evaluated on the alphanumeric database described in Baseline Norm Sub CDCN 2. Recognition accuracy is somewhat different from the Figure 1: Comparison of error rates obtained on the cen- figures reported in Fig. 1 because the version of XNIHPS sus task with no processing, spectral subtraction, spectral used to produce these data was different. All of these al- normalization, and the CDCN algorithm. SPHINX was gorithms are similar in function to other currently-popular trained on the CLSTLK microphone and tested using ei- compensation strategies (e.g. 3, 9). ther the CLSTLK microphone (solid curve) or the The DARPA ATIS robust speech evaluation. The PZM6FS microphone (broken curve). original CDCN algorithm described in 2 was used for the February, 1992, ATIS-domain robust-speech evaluation. For this evaluation, the II-XNIHPS system was trained using Figure 1 compares the error rate obtained when the XNIHPS the CLSTLK microphone, and tested using both the system is trained using the DARPA standard HMD-414 CLSTLK microphone and the unidirectional Crown closetalking microphone (CLSTLK), and tested using ei- PCC-160 microphone (PCC160). All incoming speech in ther the CLSTLK microphone or the omnidirectional this evaluation was processed by the CDCN algorithm, desktop Crown PZM-6FS microphone (PZM6FS). The regardless of whether the testing environment was actually census database was used, which contains simultaneous the CLSTLK or PCC160 microphone, and the CDCN algo- recoredings of speech from the CLSTLK and PZM6FS rithm was not provided with explicit knowledge of the microphones in the context of a speaker-independent identity of the environment within which it is operating. continuous-speech alphanumeric task with perplexity 65 2. These results demonstrate the value of the joint com- As described elsewhere in these Proceedings 10, the sys- pensation provided by the CDCN algorithm in contrast to tem used for the official robust-speech evaluations was not the independent compensation using either spectral sub- trained as thoroughly as the baseline system was lrained. traction or spectral normalization. The horizontal dotted Specifically, the official evaluations were performed after lines indicate the recognition accuracy obtained when the only a single iteration through training data that was system is tested on the microphone with which it was processed with the CDCN algorithm, and without the trained, with no processing. The intersection of the upper benefit of general English sentences in the training curve with the upper horizontal line indicates that with .esabatad CDCN compensation, XNIHPS naC recognize speech using In Fig. 2 we show the results of an unofficial evaluation of the PZM6FS microphone just as well when trained on the CLSTLK microphone as when trained using the PZM6FS. the SPHINX-II system that was performed immediately after the official evaluation was complete. The purpose of More recently we have been attempting to develop new this second evaluation was to evaluate the improvement algorithms which combine the computational simplicity of provided by an additional round of training with speech SDCN with the environmental independence of CDCN. processed by CDCN, in order to be able to directly com- One such algorithm, Blind SNR-Dependent Cepstral pare error rates on the ATIS task with CDCN with those Normalization (BSDCN) avoids the need for environment- produced by a comparably-trained system on the same specific training by establishing a correspondence between data, but without CDCN. As Fig. 2 shows, using the 275 CDCN algorithm causes the error rate to increase from segments with weak phonetic events, (20 percent of the 15.1% to only 20.4% as the testing microphone is changed errors were caused by cross-talk from other noise sources from the CLSTLK to the PCC160 microphone. In contrast, in the room, and the remaining errors could not be at- the error rate increases from 12.2% to 38.8% when one tributed to a particular cause.) Microphone arrays can, in switches from the CLSTLK to the PCC160 microphone principle, produce directionally-sensitive gain patterns that without CDCN. can be adjusted to produce maximal sensitivity in the direction of the speaker and reduced sensitivity in the ,250 direction of competing sound sources. To the extent that a such processing could improve the effective SNR at the input to a speech recognition system, the error rate would G .. be likely to be substantially decreased, because the number 30 of confusions between weak phonetic events and noise would be sharply reduced. ~) tBeT O~t.CCP 20 Several different types of array-processing strategies have ~J~ Tesl CLSTLK been applied to automatic speech recognition. The 10 simplest approach is that of the delay-and-sum beam- former, in which delays are inserted in each channel to ! ! compensate for differences in travel time between the Basollno CDCN desired sound source and the various sensors (e.g. Figure 2: Comparison of error rates obtained on the 11, 12). A second option is to use an adaptation algo- DARPA ATIS task with no processing, spectral subtrac- rithm based on minimizing mean square energy such as the tion, spectral normalization, and the CDCN algorithm. Frost or Gfiffiths-Jim algorithm 13. These algorithms II-XNmPS was trained on the CLSTLK microphone in all provide the opportunity to develop nulls in the direction of cases, and tested using either the CLSTLK microphone noise sources as well as more sharply focused beam pat- (solid curve) or the cardiod desktop Crown PCC160 terns, but they assume that the desired signal is statistically microphone (broken curve). independent of all sources of degradation. Consequently, these algorithms can provide good improvement in SNR when signal degradations are caused by additive independ- Only two sites submitted data for the present robust speech ent noise sources, but these algorithms do not perform well evaluation. CMU's percentage degradation in error rate in in reverberant environments when the distortion is at least changing from the CLSTLK to the PCC160 environment, in part a delayed version of the desired speech signal as well as the absolute error rate obtained using the 14, 15. (This problem can be avoided by only adapting PCC160 microphone, were the better of the results from during non-speech segments 16). A third type of ap- these two sites. proach to microphone yaarra processing is to use a cross- correlation-based algorithm that isolates inter-sensor dif- ferences in arrival time of the signals directly (e.g. 17). 3. MICROPHONE ARRAYS AND These algorithms are appealing because they are based on human binaural hearing, and cross-correlation is an ef- ACOUSTICAL PRE-PROCESSING ficient way to identify the direction of a strong signal source. Nevertheless, the nonlinear nature of the cross- Despite the encouraging results that we have achieved correlation operation renders it inappropriate as a means to using acoustical pre-processing, we believe that further im- directly process waveforms. We believe that signal provements in recognition accuracy can be obtained in dif- processing techniques based on human binaural perception ficult environments by combining acoustical pre- are worth pursuing, but their effectiveness for automatic processing with other complementary types of signal speech recognition remains to be conclusively processing. The use of microphone arrays is motivated by demonstrated. a desire to improve the effective SNR of speech as it is input to the recognition system. For example, the headset- Pilot evaluation of the Flanagan array. In order to ob- mounted CLSTLK microphone produces a higher SNR tain a better understanding of the ability of array process- than the PZM6FS microphone under normal circumstances ing to provide further improvements in recognition ac- because it picks up a relatively small amount of additive curacy we conducted a pilot evaluation of the 23- noise, and the incoming signal is not degraded by rever- microphone array developed by Flanagan and his col- berated components of the original speech. leagues at AT&T Bell Laboratories. The Flanagan array, which is described in detail inl l, 12, is a one- To estimate the potential significance of the reduced SNR dimensional delay-and-sum beamformer which uses 32 provided by the PZM6FS microphone in the office en- microphones that are unevenly spaced in order to provide a vironment, we manually examined all utterances in the test beamwidth that is approximately constant over the range of set of the census task that were recognized correctly when frequencies of interest. The array uses first-order gradient training and testing with the CLSTLK microphone but that microphones, which develop a null response in the vertical were recognized incorrectly when training and testing plane. We wished to compare the recognition accuracy on using the PZM6FS. We found that 54.7 percent of these the census task obtained using the Flanagan array with the errors were caused by the confusion of silence or noise accuracy observed using the CLSTLK and PZM6FS 276 microphones. We were especially interested in determin- using the CLSTLK microphone. As expected, the worst ing the extent to which array processing provides an im- results were obtained using the PZM6FS microphone, provement in recognition accuracy that is complementary while the lowest error raate was obtained for speech to the improvement in accuracy provided by acoustical recorded using the CLSTLK. More interestingly, the pre-processing algorithms such as the CDCN algorithm. results in Fig. 3 show that both the Flanagan array and the CDCN algorithm are effective in reducing the error rate, .280 and that in fact the error rate at each distance obtained with = = 3 Meters the combination of the two is very close to the error rate obtained with the CLSTLK microphone and no acoustical pre-processing. The complementary nature of the im- provement of the Flanagan array and the CDCN algorithm 40 is indicated by the fact that adding CDCN to the array improves the error rate (upper panel of Fig. 3), and that 20 converting to the array even when CDCN is already employed also improves performance (lower panel). 0 ! | ! m ! PZM6FS Array Array CLSTLKCLSTLK NCDC.4 +CDCN Microphone Type 4. PHYSIOLOGICALLY-MOTIVATED = r"2"'= FRONT ENDS AND O8,99,. = = 3 Meters ACOUSTICAL PRE-PROCESSING In recent years there has also been an increased interest'in O4 the use of peripheral signal processing schemes that are motivated by human auditory physiology and perception, 20 and a number of such schemes have been proposed (e.g. 18, ,91 20, 21). Recent evaluations indicate that with "clean" speech, such approaches tend to provide recog- 0 I I I I t SF6MZP PZM6FS Array CLSTLK CLSTLK nition accuracy that is comparable to that obtained with +CDCN +CDCN NCDC+ conventional LPC-based or DFT-based signal processing Microphone Type schemes, but that these auditory models can provide Figure 3: Comparison of recognition accuracy obtained greater robustness with respect to enviromental changes on a portion of the census task using the omnidirectional when the quality of the incoming speech (or the extent to Crown PZM-6FS, the 23-microphone array developed by which it resembles speech used in training the system) Flanagan, and the Senneheiser microphone, each with decreases 22, 23. Despite the apparent utility of such and without CDCN. Data were obtained from simul- processing schemes, no one has a deep-level understanding taneous recordings using the three microphones at dis- of why they work as well as they do, and in fact different tances of 1 and 3 meters (for the PZM-6FS and the array). researchers choose to emphasize rather different aspects of the peripheral auditory system's response to sound in their work. Most auditory models include a set of linear 41 utterances from the census database were obtained from bandpass filters with bandwidth that increases nonlinearly each of five male speakers in a sparsely-furnished with center frequency, a nonlinear rectification stage that laboratory at the Rutgers CAIP Center with hard walls and frequently includes short-term adaptation and lateral sup- floors. The reverberation time of this room was informally pression, and, in some cases, a more central display based estimated to be between 500 and 750 ms. Simultaneous on short-term temporal information. We estimate that the recordings were made of each utterance using three number of arithmetic operations of some of the currently- microphones: the Sennheiser HMD-414 (CLSTLK) popular auditory models ranges from 53 to 600 times the microphone, the Crown PZM6FS, and the Flanagan array number of operations required for the LPC-based process- with input lowpass-filtered at 8 kHz. Recordings were ing used in .II-XNmPS made with the speaker seated at distances of ,1 2, and 3 Pilot evalution of the Seneff auditory model. We meters from the PZM6FS and Flanagan array recently completed a series of pilot evaluations using an microphones, wearing the CLSTLK microphone in the implementation of the Seneff auditory model 21 on the usual fashion at all times. census databse. Since almost all evaluations of Figure 3 summarizes the error rates obtained from these physiologically-motivated front ends to date have been speech samples at two distances, 1 and 3 meters, with and performed using artificaUy-added white Gaussian noise, without the CDCN algorithm applied to the output of the we have been interested in the extent to which auditory microphone army. Error rates using the CLSTLK models can provide useful improvements in recognition microphone differed somewhat for the two distances be- accuracy for speech that has been degraded by reverbera- cause different speech samples were obtained at each dis- tion or other types of linear filtering. As in the case of tance and because the sample size is small. The XNIHPS microphone arrays, we are also especially interested in system had been previously trained on speech obtained determining the extent to which improvements in robust- 277 ness provided by auditory modelling complement those previous studies. The results in the lower panel of Fig. 4, that we already enjoy by the use of acoustical pre- demonstrate that the mean rate and GSD outputs of the processing algorithms such as CDCN. Seneff model provide lower error rates than conventional LPC cepstra when the system is trained using the CLSTLK We compared error rates obtained using the standard 12 microphone and tested using the PZM6FS. Nevertheless, LPC-based cepstral coefficents normally input to the the level of performance achieved by the present im- XNIHPS system, with those obtained using an implemen- plementation of the auditory model is not as good as that tation of the 40-channel mean-rate output of the Seneff achieved by conventional LPC cepstra combined with the model 21, and with the 40-channel outputs of Seneff's CDCN algorithm on the same data (Fig. .)1 Furthermore, Generalized Synchrony Detectors (GSDs). The system the combination of conventional LPC-based processing was evaluated using the original testing database from the and the CDCN algorithm produced performance that census task with the CLSTLK and PZM6FS microphones, equaled or bettered the best performance obtained with the and also with white Ganssian noise artificially added at auditory model for each test condition. Because the sig.nal-to-noise ratios of +10, +20, and +30 dB, measured auditory model is nonlinear and not easy to port from one using the global SNR method described in 19. site to another, these comparisons should all be regarded as preliminary. It is quite possible that performance using the 100 Z~ auditory model could further improve if greater attention ... 80 J." "~ LPC were paid to tuning it to more closely match the charac- I~ ~ , ~"'~' LPC+CDCN teristics of .XNIHPS 60 ,me Mean Rate %% : : GSD We also attempted to determine the extent to which a com- 4o bination of auditory processing and the CDCN algorithm 2O could provide greater recognition accuracy than either processing scheme used in isolation. In these experiments 0 lO'dB 20'dB 30'dB CI;an we combined the effects of CDCN and auditory processing SNR by resynthesizing the speech waveform from cepstral coef- ficients that were produced by the original LPC front end ,~. 100 and then modified by the CDCN algorithm. The resyn- @ os thesized speech, which was totally intelligible, was then passed through the Seneff auditory model in the usual fashion. Unfortunately, it was found that this particular 4O k- ~ CPL combination of CDCN and the auditory model did not im- prove the recognition error etaar beyond the level achieved 20 : = Mean Rate by CDCN alone. A subsequent error analysis revealed that - - GSD this concatenation of cepstral processing and the CDCN 10 ' dB 20 ' dB 30'dB Clean ' algorithm, followed by resynthesis and processing by the SNR original XNIHPS front end, degraded the error rates even in Figure 4: Pilot data comparing error rates obtained on the the absence of the auditory processing, although analysis census task using the conventional LPC-based processing and resynthesis without the CDCN algorithm did not of SPHINX with results obtained using the mean rate and produce much degradation. This indicates that useful in- synchrony outputs of the Seneff auditory model. XNIHPS formation for speech recognition is lost when the resyn- was trained on the CLSTLK microphone in all cases, and thesis process is performed after the CDCN algorithm is run. Hence we regard this experiment as inconclusive, and tested using either the CLSTLK microphone (upper panel) we intend to explore other types of combinations of acous- or the Crown PZM6FS microphone (lower panel). White tical pre-processing with auditory modelling in the future. noise was artificially added to the speech signals and data are plotted as a function of global SNR. 5. SUMMARY AND CONCLUSIONS Figure 4 summarizes the results of these comparisons, with error rate plotted as a function of SNR using each of the In this paper we describe our current research in acoustical three peripheral signal processing schemes. The upper pre-processing for robust speech recognition, as well as panel describes recognition error rates obtained with the our first attempts to integrate pre-processing with other system both trained and tested using the CLSTLK approaches to robust speech recognition. The CDCN algo- microphone, and the lower panel describes error rates ob- rithm was also applied to the ATIS task for the first time, tained with the system trained with the CLSTLK and provided the best recognition scores for speech col- microphone but tested with the PZM6FS microphone. lected using the unidirectional desktop PCC160 When the system is trained and tested using the CLSTLK microphone. We demonstrated that the CDCN algorithm microphone, best performance is obtained using conven- and the Flanagan delay-and-sum microphone array can tional LPC-based signal processing for "clean" speech. As provide complementary benefits to speech recognition in the SNR is decreased, however, error rates obtained using reverberant environments. We also found that the Seneff either the mean rate or GSD outputs of the Seneff model auditory model improves recognition accuracy of the CMU degrade more gradually confirming similar findings from speech system in reverberant as well as noisy environ- 278 ments, but preliminary efforts to combine the auditory .01 Ward, W., Issar, S., Huang, X., Hon, H.-W., Hwang, model with the CDCN algorithm were inconclusive. M.-Y., Young, S., Matessa, M., Liu, F.-H., Stem, R., "Speech Understanding in Open Tasks", Proc. of the DARPA Speech and Natural Language Workshop, ACKNOWLEDGMENTS February 1992. .11 Flanagan, J. L., Johnston, J. D., Zahn, R., and Elko, G.W., This research is sponsored by the Defense Advanced Research "Computer-steered Microphone Arrays for Sound Projects Agency, DoD, through APRA Order 7239, and monitored Transduction in Large Rooms", The Journal of the by the Space and Naval Warfare Systems Command under con- Acoustical Society of America, Vol. 78, Nov. 1985, pp. tract N00039-91-C-0158. Views and conclusions contained in .8151-8051 this document are those of the authors and should not be inter- preted as representing official policies, either expressed or im- .21 Flanagan, J. L., Berkeley, D. A., Elko, G. W., West, J. ,.E plied, of the Defense Advanced Research Projects Agency or of and Sondhi, M. M., "Autodirective microphone sys- the United States Government. We thank Hsiao-Wuen Hon, terns", Acustica, Vol. 73, February 1991, pp. 58-71. Xuedong Huang, Kai-Fu Lee, Raj Reddy, Eric Thayer, Bob .31 Widrow, B., and Stearns, S. D., Adaptive Signal Weide, and the rest of the speech group for their contributions to Processing, Prentice-Hall, Englewood Cliffs, NJ, 1985. this work. We also thank Jim Flanagan, Joe French, and A. C. Surendran for their assistance in obtaining the experimental .41 Peterson, P. M., "Adaptive Array Processing for Multiple data using the array microphone, and Stephanie Seneff for Microphone Hearing Aids". RLE TR No. 541, Res. Lab. providing source code for her auditory model. The graduate of Electronics, MIT, Cambridge, MA studies of Tom Sullivan and Yoshlaki Ohshima have been sup- ported by Motorola and IBM Japan, respectively. .51 Alvarado, V. M., Silverman, H. F., "Experimental Results Showing the Effects of Optimal Spacing Between Elements of a Linear Microphone Array", ICASSP-90, REFERENCES April 1990, pp. 837-840. .61 Van Compernolle, D., "Switching Adaptive Filters for .1 Juang, B. H., "Speech Recognition in Adverse Envoron- Enhancing Noisy and Reverberant Speech from ments", Comp. Speech and Lang., Vol. 5, 1991, pp. Microphone Array Recordings", IEEE International Con- 275 -294. ference on Acoustics, Speech, and Signal Processing, April 1990, pp. 833-836. 2. Acero, A. and Stern, R. M., "Environmental Robusmess in Automatic Speech Recognition", ICASSP-90, April .71 Lyon, R. F., "A Computational Model of Binaural ,0991 pp. 849-852. Localization and Separation", IEEE International Con- ference on Acoustics, Speech, and Signal Processing, 3. Erell, A. and Weintraub, M., "Estimation Using Log- ,3891 pp. 1148-1151. Spectral-Distance Criterion for Noise-Robust Speech Recognition", ICASSP-90, April 1990, pp. 853-856. .81 Cohen, J. R., "Application of an Auditory Model to Speech Recognition", The Journal of the Acoustical 4. Acero, A. and Stem, R. M., "Robust Speech Recognition Society of America, Vol. ,58 No. ,6 June 1989, pp. by Normalization of the Acoustic Space", ICASSP-91, 2623 -2629. May 1991, pp. 893-896. .91 Ghitza, O., "Auditory Nerve Representation as a Front- 5. Liu, F.-H., Acero, A., and Stern, R. M., "Efficient Joint End for Speech Recognition in a Noisy Environment", Compensation of Speech for the Effects of Additive Noise .paroC Speech and Lang., Vol. ,1 1986, pp. 109-130. and Linear Filtering", IEEE International Conference on Acoustics, Speech, and Signal Processing, March 1992. 20. Lyon, R. F., "A Computational Model of Filtering, Detection, and Compression in the Cochlea", IEEE Inter- .6 Stockham, T. G., Cannon, T. M,. and lngebretsen, R. B., national Conference on Acoustics, Speech, and Signal "Blind Deconvolution Through Digital Signal Process- Processing, May 1982, pp. 1282-1285. ing", Proc. IEEE, Vol. ,36 1975, pp. 678-692. 21. Seneff, S., "A Joint SynchronyMean-Rate Model of 7. Boll, .S F. , "Suppression of Acoustic Noise in Speech Auditory Speech Processing", Journal of Phonetics, Vol. Using Spectral Subtraction", ASSP, Vol. 27, 1979, pp. ,61 No. ,1 January 1988, pp. 55-76. .021-311 22. Hunt, M., "A Comparison of Several Acoustic Represen- 8. Sakoe, H., Chiba, S., "Dynamic Programming Algorithm tations for Speech Recognition with Degraded and Un- Optimization for Spoken Word Recognition", IEEE degraded Speech", IEEE International Conference on Transactions on Acoustics, Speech, and Signal Acoustics, Speech, and Signal Processing, May 1989. Processing, Vol. 26, 1978, pp. 43-49. 23. Meng, H., and Zue, V. W., "A Comparative Study of 9. Hermansky, H., Morgan, N., Bayya, A., Kohn, P., "Com- Acoustic Representations of Speech for Vowel Classifica- pensation for the Effect of the Communication Channel in tion Using Multi-Layer Perceptrons", Int. Conf. on Auditory-Like Analysis of Speech (RASTA-PLP)", Proc. Spoken Lang. Processing, November 1990, pp. of the Second European Confi on Speech Comm. and .6501-3501 Tech., September 1991. 279

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.