Prediction and Optimization of Speech Intelligibility in Adverse Conditions Proefschrift ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus Prof. ir. K. C. A. M. Luyben, voorzitter van het College voor Promoties, in het openbaar te verdedigen op vrijdag 25 januari 2013 om 10:00 uur door Cornelis (Cees) Harm TAAL Ingenieur, Media & Kennis Technologie geboren te Hoogeveen. Dit proefschrift is goedgekeurd door de promotor: Prof. dr. ir. R. L. Lagendijk copromotor: Dr. ir. R. Heusdens Samenstelling promotiecommissie: Rector Magnificus, voorzitter Prof. dr. ir. R. L. Lagendijk, Technische Universiteit Delft, promotor Dr. ir. R. Heusdens, Technische Universiteit Delft, copromotor Prof. dr. ir. W. A. Dreschler, Academisch Medisch Centrum, Amsterdam Prof. dr. ir. G. J. T. Leus, Technische Universiteit Delft Prof. dr. P. A. Naylor, Imperial College, London, United Kingdom Prof. dr. ir. L. J. van Vliet, Technische Universiteit Delft Prof. dr. J. Wouters Katholieke Universiteit Leuven, Belgi¨e Dr.ir.R.C.Hendriksheeftalsbegeleiderinbelangrijkemateaandetotstand- koming van het proefschrift bijgedragen. TheworkdescribedinthisthesiswasfinanciallysupportedbySTWandOticon A/S. Copyright c2013 by C. H. Taal (cid:13) Allrightsreserved. Nopartofthisthesismaybereproducedortransmittedin any form or by any means, electronic, mechanical, photocopying, any informa- tion storage or retrieval system, or otherwise, without written permission from the copyright owner. Prediction and Optimization of Speech Intelligibility in Adverse Conditions Summary In digital speech-communication systems like mobile phones, public address systems and hearing aids, conveying the message is one of the most impor- tant goals. This can be challenging since the intelligibility of the speech may be harmed at various stages before, during and after the transmission process from sender to receiver. Causes which create such adverse conditions include background noise, an unreliable internet connection during a Skype conversa- tion or a hearing impairment of the receiver. To overcome this, many speech- communication systems include speech processing algorithms to compensate for these signal degradations like noise reduction. To determine the effect on speechintelligibilityofthesesignalprocessingbasedsolutions,thespeechsignal hastobeevaluatedbymeansofalisteningtestwithhumanlisteners. However, such tests are costly and time consuming. As an alternative, reliable and fast machine-drivenintelligibilitypredictorsareofinterest,sincetheymightreplace listening tests, at least in some stages of the algorithm development process. Two important issues exist with current intelligibility predictors. (1) Many of these methods cannot reliably predict the effect of more advanced nonlinear signal processing algorithms on speech intelligibility. (2) Typically, these mea- sures are based on very complex auditory models or use average statistics of minutesofrunningspeech,whichmakesitdifficultonhowtodesignnew(real- time) speech processing solutions in an optimal manner given such a measure. To this end we propose several new measures which show good prediction re- sultswiththeintelligibilityofnonlinearprocessedspeech. Thenewlyproposed measures are of a low computational complexity and mathematically tractable which make them suitable for optimization of new signal processing solutions which aim for improving speech intelligibility. An important stage in many speech intelligibility predictors is the use of an auditory model. In the first part of this thesis we show that a general so- phisticatedauditorymodelcanbegreatlysimplified,whilepreservingaccurate predictions of psycho-acoustic listening experiments. The resulting simplified modelfacilitatesthecomputationofanalyticexpressionsformaskingthresholds whileadvancedstate-of-the-artmodelstypicallyneedcomputationallydemand- ingadaptiveprocedures. Itsmathematicalpropertiesaresuccessfullyexploited by optimally redistributing speech energy such that the speech intelligibility is improved when played back in a noisy environment without modifying the ii signal-to-noise ratio. In the design process of new intelligibility predictors we first analyse the strengths and weaknesses of existing measures. In total, 17 different mea- suresareevaluatedforintelligibilitypredictionoftime-frequencyweightednoisy speech. We show that, despite high correlation with the listening test scores, several measures cannot predict the difference in intelligibility before and after signal processing. We explain that a state-of-the-art measure was not able to predict the intelligibility due to its sensitivity to the DFT-phase components. Issues with existing measures for intelligibility prediction are highlighted and a general normalization procedure as a pre-processing step is proposed which improves their correlation with speech intelligibility. We propose a new short-time intelligibility measure (STOI) which shows highcorrelationwiththeintelligibilityoftime-frequencyweightednoisyspeech, including noise-reduced and vocoded speech. In general, STOI shows better correlation with speech intelligibility compared to five other state-of-the-art objective intelligibility models. One important difference between STOI and other measures is its analysis length which is in the order of a few hundreds of ms rather than complete sentences or 20-30 ms length frames. Due to the simple structure of STOI we show in the final part of this thesis that the measure can be interpreted as a mathematical norm, which is applied in the channel-selection technique with cochlear-implant simulations. Several intel- ligibility predictors indicate large intelligibility improvements with the new method based on STOI compared to a peak-picking algorithm. Contents Summary i 1 Introduction 1 1.1 Thesis Scope and Contributions . . . . . . . . . . . . . . . . . . 5 1.2 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Background 11 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Dau-model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 Internal Representation . . . . . . . . . . . . . . . . . . 14 2.3.2 Defining a Perceptual Difference . . . . . . . . . . . . . 18 2.4 Par-model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.1 Internal Representations . . . . . . . . . . . . . . . . . . 22 2.4.2 Defining a Perceptual Difference . . . . . . . . . . . . . 26 2.4.3 Mathematical Properties. . . . . . . . . . . . . . . . . . 27 2.5 Coherence Speech Intelligibility Index . . . . . . . . . . . . . . 28 2.5.1 SII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.2 SII based on Coherence . . . . . . . . . . . . . . . . . . 33 2.6 Relation to Thesis Chapters . . . . . . . . . . . . . . . . . . . . 34 3 A Low-complexity Spectro-Temporal Distortion Measure 37 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Proposed Spectro-Temporal Distortion Measure . . . . . . . . . 40 3.3.1 Auditory Model . . . . . . . . . . . . . . . . . . . . . . 40 3.3.2 Perceptual Distance between Internal Representations . 42 iv Contents 3.3.3 Low-complexity Approximation . . . . . . . . . . . . . . 43 3.3.4 Implementation Details . . . . . . . . . . . . . . . . . . 44 3.4 Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.1 Masking Threshold . . . . . . . . . . . . . . . . . . . . . 45 3.4.2 Masking Curve . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Model Evaluation and Comparison . . . . . . . . . . . . . . . . 47 3.5.1 Reference models . . . . . . . . . . . . . . . . . . . . . . 47 3.5.2 Prediction of Masking Curves . . . . . . . . . . . . . . . 52 3.5.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.6 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . 56 3.6.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6.2 Listening Test. . . . . . . . . . . . . . . . . . . . . . . . 59 3.7 Relation Between Proposed Model and the Par-model . . . . . 60 3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.A Derivation of spectro-temporal gain function g . . . . . . . . . 62 i 4 An Evaluation of Intelligibility Predictors 65 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Intelligibility Data . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.1 Signal Processing . . . . . . . . . . . . . . . . . . . . . . 68 4.2.2 Test Material . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.3 Listening Experiment . . . . . . . . . . . . . . . . . . . 70 4.3 Objective Measures . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.2 Intelligibility Measures . . . . . . . . . . . . . . . . . . . 72 4.3.3 Speech Quality Measures . . . . . . . . . . . . . . . . . 74 4.3.4 Proposed Measures MCC and LCC . . . . . . . . . . . . 76 4.4 A Critical-Band Based Normalization Procedure . . . . . . . . 77 4.5 Evaluation Procedure . . . . . . . . . . . . . . . . . . . . . . . 78 4.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . 81 4.6.1 Detailed Evaluation of Intelligibility Measures . . . . . . 81 4.6.2 Detailed Evaluation of Speech Quality Measures . . . . 84 4.6.3 Influence of Critical-Band Based Normalization Procedure 87 4.7 Generality of Results . . . . . . . . . . . . . . . . . . . . . . . . 89 4.7.1 Single-Channel Noise-Reduced Speech . . . . . . . . . . 90 4.7.2 Other Types of Signal Degradations . . . . . . . . . . . 91 4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Contents v 5 An Intelligibility Predictor for TF-Weighted Noisy Speech 95 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.1.1 Rationale of Proposed Intelligibility Measure . . . . . . 97 5.1.2 Further Outline . . . . . . . . . . . . . . . . . . . . . . . 98 5.2 STOI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2.1 Example of Normalization and Clipping Procedure . . . 101 5.3 Listening Experiments . . . . . . . . . . . . . . . . . . . . . . . 102 5.3.1 Ideal Time Frequency Segregation . . . . . . . . . . . . 103 5.3.2 Single-Channel Noise Reduction . . . . . . . . . . . . . 104 5.3.3 ITFS with artificially introduced errors . . . . . . . . . 107 5.4 Evaluation procedure . . . . . . . . . . . . . . . . . . . . . . . . 107 5.4.1 Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.4.2 Reference Objective Measures . . . . . . . . . . . . . . . 109 5.5 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5.1 Correlation between STOI and Intelligibility Scores. . . 113 5.5.2 Analysis of Absolute Intelligibility Predictions. . . . . . 117 5.5.3 Effect of parameters N and β . . . . . . . . . . . . . . . 118 5.5.4 Comparison with Other Intelligibility Models . . . . . . 120 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6 Energy Redistribution for Intelligibility Improvement 123 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.2 Proposed Speech Pre-Processing Algorithm . . . . . . . . . . . 128 6.2.1 Perceptual Distortion Measure . . . . . . . . . . . . . . 128 6.2.2 Power-Constrained Speech-Audibility Optimization . . . 129 6.2.3 Implementation Details . . . . . . . . . . . . . . . . . . 131 6.2.4 Properties and Examples . . . . . . . . . . . . . . . . . 135 6.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . 135 6.3.1 Speech Intelligibility . . . . . . . . . . . . . . . . . . . . 135 6.3.2 Speech Quality . . . . . . . . . . . . . . . . . . . . . . . 139 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.4.1 Speech Quality versus Intelligibility . . . . . . . . . . . 141 6.4.2 Predicted Intelligibility versus Algorithmic Delay . . . . 143 6.4.3 Algorithm Performance in Far-End Noisy Conditions . . 144 6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 vi Contents 7 STOI-based Matching Pursuit for CI channel selection 147 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.2 Derivation of Intelligibility Metric. . . . . . . . . . . . . . . . . 148 7.2.1 STOI Background and Simplification . . . . . . . . . . . 149 7.2.2 Interpretation as weighted ℓ norm . . . . . . . . . . . . 150 2 7.3 Application to CI channel selection . . . . . . . . . . . . . . . . 151 7.3.1 Intelligibility Relevant Matching Pursuit . . . . . . . . . 151 7.4 Vocoder Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.5 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . 153 7.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . 157 8 Discussion and Conclusions 159 8.1 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 8.2 Directions of Future Research . . . . . . . . . . . . . . . . . . . 163 References 165 Samenvatting 179 Acknowledgements 181 Curriculum Vitae 183
Description: