2014 IEEE XXXIV International Scientific Conference Electronics and Nanotechnology (ELNANO) On existance of optimal boundary value between early reflections and late reverberation Arkadiy Prodeus Olga Ladoshko Acoustic and Electroacoustic Department Acoustic and Electroacoustic Department Faculty of Electronics, NTUU KPI Faculty of Electronics, NTUU KPI Kyiv, Ukraine Kyiv, Ukraine [email protected] [email protected] Abstract—Enhancement of speech distorted by reverberation communication systems when using a relatively simple pre- is issue of the day. The problem has been actively studied in the processor of speech dereverberation, proposed by authors. last decade. However, it is still extremely difficult to find clear recommendations on choice of boundary value between early II. REVERBERATION MODEL reflections and late reverberation, optimal in sense of such criteria as speech recognition accuracy and speech quality. The reverberant signal y(t)results from the convolution of Another problem is getting of simple pre-processor of speech the anechoic speech signal x(t) and the causal time-invariant dereverberation. The problems are investigated in the paper. Acoustic Impulse Response (AIR) h(t): Keywords— speech enhancement; late reverberation; dereverberation ∞ y(t)= ∫h(v)x(t−v)dv= x(t)⊗h(t), I. INTRODUCTION 0 Problem of speech dereverberation in communication and automatic speech recognition (ASR) systems is issue of the were ⊗ is convolution symbol. day [1-4]. This problem was especially actively investigated in When selecting in AIR h(t) (Fig. 1) regions corresponding the last decade due to the rapid development of mobile to early reflections and late reflections communications. It was found that late reverberation is main detrimental factor which may be interpreted as kind of noise. h(t), 0≤t≤T ; h(t+T ), t≥0; Unfortunately, strong non-stationarity of late reverberation h (t)= l h (t)= l makes ineffective traditional techniques of noise suppression i 0, др.t, l 0, др.t, [1], because these techniques are designed for stationary or slow non-stationary noise. reverberation action can be described as At the same time it was found that late reverberation y(t)=h (t)⊗x(t)+h (t)⊗x(t−T )=h (t)⊗x(t)+r(t), (1) power spectrum may be relatively easily estimated when i l l i Polack’s statistical reverberation model is chosen [2]. The formula for such estimation is simple both for calculation and where r(t) is component due to late reverberation; Tl is time, for understanding. But the formula contains parameter T , corresponding to boundary between early reflections and late l reverberation (see Fig. 1). which is time corresponding to boundary between early When comparing model (1) with additive noise model reflections and late reverberation. The problem is that the boundary is blurred: we find T ≈30...60 ms in [2] and l y(t)=x(t)+n(t), 40…100 ms in [3]. Moreover, these values were experimentally obtained when problems of speech where n(t) is stationary stochastic process, it has become intelligibility and musical clarity were investigated [5], and it isn’t evident that the same values will be good for speech clear why late reverberation may be interpreted as kind of recognition and communication systems. noise. Unfortunately, strong non-stationarity of late reverberation makes ineffective traditional techniques of noise The objective of this paper is an investigation of existence suppression [1], because these techniques are designed for of parameter T optimal values in sense of such criteria as l stationary or slow non-stationary noise. At the same time, speech recognition accuracy and speech quality. Another influence of early reflections, described with convolution of objective is performance evaluation of speech recognition and signal x(t) and AIR h (t), may be compensated in ASR i 978-1-4799-4580-1/14/$31.00 ©2014 IEEE 2014 IEEE XXXIV International Scientific Conference Electronics and Nanotechnology (ELNANO) systems by standard techniques, such as, for example, mean • priori SNR ξ(l,k) estimation. cepstral normalization [2]. When modifying scheme of noise suppression, which is made in accordance with (2)-(4), let's just substitute late reverberation spectrum λ (l,k) assessment instead of noise r spectrum λ (l,k) assessment (this assessment unit is marked n out in bold line), as it is shown at Fig. 2. For distances between speech source and microphone, which are more then critical distance D , late reverberation c power spectrum λ (l,k) may be calculated by spectrum r λ (l,k) of signal y(t) [2]: y Fig. 1. Room AIR structure λr(l,k)=e−2δ(k)Tl ⋅λy(l−Nl,k), (5) III. PROPOSED PRE-PROCESSOR OF DEREVERBERATION where N =T F /R; R denotes the frame rate in samples of Let us show that late reverberation suppression procedure l l s the short-time Fourier transform (STFT); may be realized almost by the same remedies which are usually using for noise suppression. The only distinction δ(k)=2ln10 T60(k); T60(k) is reverberation time. consists in estimation of late reverberation spectrum instead of The meaning of (5) is quite simple: the current speech noise spectrum. sounds are masked by previous sounds of speech. Correction in frequency field is one of the most spread Begin approaches to noise suppression [1, 2]: Method parameters selection λˆ12(l,k)=G(l,k)λ12(l,k), (2) Splitting signal up into frames x y Calculate power spectrum in frames where λ (l,k) is power spectrum of l-th signal y(t) frame y Calculate late reverberation power spectrum in at frequency f =kF /N ; F is sampling frequency; N k s fft s fft framesagainst of noise power spectrum ˆ is FFT parameter; k is number of frequency sample; λ (l,k) x Calculate gain of enhancement filter is power spectrum estimator of l-th frame of signal x(t) for Calculate spectrum of enhanced signal k-th frequency sample; G(l,k) is correction filter gain for l- Calculate IFFT for spectrum in frames Frames merging in time domain th signal y(t) frame for k-th frequency sample. Without loss of conclusions generality, let us consider, for End determinacy, logMMSE method [6], for which enhancement filter gain is Fig. 2. Proposed pre-processor of dereverberation Smoothing is necessary to enhance the estimation accuracy ξ(l,k) 1 ∞ e−t G(l,k)=1+ξ(l,k)exp2 ∫ t dt, (3) of the spectrum λy(l,k) [2]: v(l,k) ˆ ˆ 2 λy(l,k)=ηy(k)λy(l−1,k)+(1−ηy(k))Y(l,k) , (6) ξ(l,k) v(l,k)= γ(l,k), (4) where Y(l,k) is discrete Fourier transform (DFT) of l-th 1+ξ(l,k) frame of signal y(t); where ξ(l,k)=λ (l,k) λ (l,k) is prior signal-to-noise ratio x n (7) (SNR); γ(l,k)=λy(l,k) λn(l,k) - posterior SNR; λn(l,k) - ηdy(k), Y(l,k)2 ≤λˆy(l−1,k); η (k)= power spectrum of l-th noisen(t) frame at frequency fk. y ηa(k) otherwise. y Fundamentally important and difficult are next two subtasks when implementing the logMMSE method: Upper-bound of constant ηd(k) (0≤ηd(k)<1) is y y • noise spectrum λ (l,k) estimation; n 978-1-4799-4580-1/14/$31.00 ©2014 IEEE 2014 IEEE XXXIV International Scientific Conference Electronics and Nanotechnology (ELNANO) 1 N−D−S−I ηd(k)= , (8) Acc%= ×100%, y 1+2δ(k)R F N s where N is the total number of labels in the reference and the constant ηa(k) is selected from the conditions y transcriptions; D is the number of deletion errors; S is the 0≤ηa(k)<ηd(k). number of substitution errors; I is the number of insertion y y errors. Indicator PESQ had been used for speech quality assessment [8]. It is interesting that sometimes this indicator is IV. SIMULATION EXAMPLES used for speech intelligibility assessment [10]. VoiceBox [7] routine “ssubmmse.m” designed to reduce the noise was modified in accordance with propositions of previous section. Reverberation time was estimated by applying Schroeder’s method [2] to a bandpass filtered versions of the AIRs. Moreover, it was taken a d η (k)=0,5⋅η (k). y y A. Qualitative evaluation of dereverberation performance Real speech signal was recorded in room with volume 80 m3 and time reverberation 1.1 s (the AIR is shown in Fig. 3). Parameters of digitized sounds are: sampling frequency 22050 Hz, linear quantization 16 bit. Distance between speaker and microphone was near 2 m. It is much more of critical distance Fig. 3. Reverberant (а) and enhanced (b) speech signals D ≈0,5m (D value is calculated by (3.1) from [2]). c c Fig. 4. Spectrograms of reverberant (a) and enhanced (b) signals Fig. 2. Room AIR, T20=1.1 s Toolkit HTK [9] had been used for ASR system simulation. Training of ASR system had been made with Waveforms of reverberant and enhanced signals are shown usage of 269 samples of 27 words saved for two speakers- in Fig. 4, and proper spectrograms are shown in Fig. 5. On women. Sound file of discrete speech (with 0.2…0.5 s pauses) hearing distorted signal is resound, whereas reconstructed was used as test signal, there were used all 27 words in signal is much less resound, i.e. positive effect of reverberation training. There were 27 phonemes of Ukrainian language in suppression is evident. However, there is noticeable by ear phoneme vocabulary and there had been used 39 slight distortion introduced by the dereverberation procedure (it MFCC_0_D_A coefficients when ASR simulating. was taken T =48ms upon the procedure). Increasing T to l l Table I contains results of Acc% and PESQ assessment for 100 ms led to some improvement in sound quality of enhanced clear and reverberant signals. Signals distorted by reverberation signal. It demonstrates real problem of precise determination of were simulated as convolution of clean speech signal and room parameter T value. l AIRs. There were three rooms with reverberation times 0.74, 0.89 and 1.10 s. Sounds of bursting rubber ball were used as B. Qualitative evaluation of dereverberation performance AIRs for these rooms. T is the reverberation time, in 20 Quantitative evaluation of dereverberation performance had seconds, based on a 20 dB evaluation range [11]. been made by means of objective measures, such as ASR As it can be seen from Table I, reverberation significantly accuracy and speech quality. ASR accuracy assessed using the affects both the Acc% (reduced from 93% to 22 ... 30%) and indicator: the PESQ (reduced from 4.5 to 2.03 ... 2.28). 978-1-4799-4580-1/14/$31.00 ©2014 IEEE 2014 IEEE XXXIV International Scientific Conference Electronics and Nanotechnology (ELNANO) TABLE I. Acc% AND PESQ FOR CLEAR AND REVERBERANT SIGNALS Signal kind T20, s Acc% PESQ Clear 0 92.59 4.5 0.74 22.22 2.281 Reverberant 0.89 22.22 2.073 1.10 29.63 2.030 Results of Acc% and PESQ estimation for enhanced speech signals are shown in Table II and Fig. 6-7. As it can be seen, enhancement by method 1 (usage of Fig. 6. Speech quality in the absence and presence of enhancement “classic” logMMSE method intended for noise suppression) did not lead to positive results. Meanwhile, enhancement by method 2 (usage of proposed method) had made it possible to significantly increase the Acc% value (raised from 22 ... 30% to 56…74%). It is interesting that PESQ value did not raised so much (increased from 2.281 to 2.33 for T =0.74 c, and only from 2.073 to 20 2.08 for T =0.89 c). 20 Results of experimental studies of dependencies Acc%(Tl) and PESQ(Tl) are shown in Table III and Fig. 8-9. Fig. 7. Acc%(Tl) dependency It follows from these results that optimal, in sense of Acc% maximum, T value lies in the interval 100…200 ms. l More uncertain is situation with PESQ(T ) dependency. In l two of three cases the speech quality decreases with increasing T values, and only one case was observed with weakly l pronounced maximum at T ≈200...240 ms. l TABLE II. Acc% AND PESQ FOR ENHANCED SIGNALS Acc% PESQ T20 Enhanced Enhanced Enhanced Enhanced Fig. 8. PESQ(Tl) dependency (s) by by by by method 1 method 2 method 1 method 2 TABLE III. Acc% AND PESQ FOR DIFFERENT Tl 0,74 18.52 74.1 2.252 2.33 0,89 14.81 55.6 2.059 2.08 T20, s Tl, ms Acc% PESQ 48 66.7 2.33 1,1 29.63 62.3 2.037 2.23 96 74.1 2.30 144 70.4 2.27 0.74 192 59.3 2.26 240 48.2 2.27 288 44.4 2.26 48 51.9 2.00 96 51.9 2.06 144 51.9 2.05 0.89 192 55.6 2.08 240 44.4 2.08 288 33.3 2.07 48 62.3 2.23 96 62.3 2.19 144 55.6 2.16 1.10 192 51.9 2.07 240 48.2 2.03 288 44.4 2.01 Fig. 5. Recognition accuracy with and without speech enhancement 978-1-4799-4580-1/14/$31.00 ©2014 IEEE 2014 IEEE XXXIV International Scientific Conference Electronics and Nanotechnology (ELNANO) V. DISCUSSION Performance evaluation of speech recognition and communication systems when using a relatively simple pre- As it can be seen from experimental results (for a range of processor of speech dereverberation, proposed by authors, had reverberation times of 0.7 ... 1.1, which are typical for been realized. Proposed method consists in modifying the laboratories, classrooms and lecture halls), reverberation can existing logMMSE method, where late reverberation spectrum significantly reduce ASR accuracy and speech quality. In estimator is used instead of noise spectrum estimator. Fidelity particular, Acc% value decreased from 93% to 22 ... 30% and of the proposal was verified experimentally: Acc% value PESQ value decreased from 4.5 to 2.0 ... 2.3. raised to 64% from 25%, and PESQ value also increased, Direct application of the logMMSE method to reverberant though much less. speech signals does not allow increasing the accuracy and quality of speech even a small degree. Proposed modification REFERENCES of logMMSE method has improved Acc% significantly, from 22 ... 30% up to 56 ... 75%, and PESQ values also increased, [1] Israel Cohen, Jacob Benesty, and Sharon Gannot (Eds.), Speech from 2.13 up to 2.21. Processing in Modern Communication: Challenges and Perspetives. The obtained results are preliminary in nature, because of Jan. 2010, 342 p. training and test samples volumes were small, and the only [2] P. Naylor and N. Gaubitch, Speech Dereverberation. Springer, 2010, 399 p. logMMSE method was used from a wide set of speech enhancement methods. It is natural to expect that similar [3] Habets E.A.P. Single- and Multi-Microphone Speech Dereverberation using Spectral Enhancement, PhD dissertation, Eindhoven, 2007, 257 p. conclusions will be valid for other methods, such, for example, [4] T. Yoshioka et al., “Making Mashine Understand Us in Reverberant as spectral subtraction and MMSE [1]. Rooms,” IEEE Signal Processing Magazine, pp.114-126, Nov. 2012. Reverberation time was estimated from available AIRs in [5] J.S. Bradley, “The Evolution of Newer Auditorium Acoustics the paper. In many cases, it is necessary to perform blind Measures,” Canadian Acoustics, 18(4), pp. 13-23, 1990. reverberation time estimation. Naturally predict that a blind [6] Y. Ephraim and D. Malah, “Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator,” IEEE estimation of reverberation time leads to deterioration of signal Transactions on Acoustic, Speech, and Signal Processing, vol. ASSP-33, quality and ASR accuracy. Assessment of the extent of this No. 2, pp. 443-445, Apr. 1985. deterioration can be an object of the next work. [7] VOICEBOX: Speech Processing Toolbox for MATLAB [Online]. Available: http://www.ee.ic.ac.uk/hp/staff/dmb/ VI. CONCLUSIONS [8] P. Loizou, Speech enhancement: Theory and Practice. Boca Raton: CRC Press, 2007, 632 p. Experimental studies of dependencies Acc%(T ) and l [9] S. Young et al., The HTK Book. Cambridge University Engineering PESQ(T )were conducted. It was shown that optimal, in sense Department, 2005, 354 p. [Online]. Available: l http://htk.eng.cam.ac.uk/download.shtml of Acc% maximum, T value lies in the interval 100…200 l [10] J. Beerends, E. Larsen, N. Lyer, and J. van Vugt, “Measurement of ms. More uncertain is situation with PESQ(T ) dependency, speech intelligibility based on the PESQ approach,” in Proc. Int. Conf. l Meas. Speech Audio Quality Netw., 2004, 4 p. where, in two of three cases, the speech quality decreased with [11] ISO 3382-1:2009. Acoustics. Measurement of room acoustic parameters. increasing Tl values, and only one case was observed with Part 1. Performance spaces. ISO, 2009, 26 p. weakly pronounced maximum at T ≈200...240 ms. l 978-1-4799-4580-1/14/$31.00 ©2014 IEEE