ebook img

Human and Automatic Speaker Recognition over Telecommunication Channels PDF

178 Pages·2016·3.955 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Human and Automatic Speaker Recognition over Telecommunication Channels

T-Labs Series in Telecommunication Services Laura Fernández Gallardo Human and Automatic Speaker Recognition over Telecommunication Channels T-Labs Series in Telecommunication Services Series editors Sebastian Möller, Berlin, Germany Axel Küpper, Berlin, Germany Alexander Raake, Berlin, Germany More information about this series at http://www.springer.com/series/10013 á Laura Fern ndez Gallardo Human and Automatic Speaker Recognition over Telecommunication Channels 123 Laura FernándezGallardo University of Canberra Canberra,ACT Australia ISSN 2192-2810 ISSN 2192-2829 (electronic) T-Labs Series in Telecommunication Services ISBN978-981-287-726-0 ISBN978-981-287-727-7 (eBook) DOI 10.1007/978-981-287-727-7 LibraryofCongressControlNumber:2015946762 SpringerSingaporeHeidelbergNewYorkDordrechtLondon ©SpringerScience+BusinessMediaSingapore2016 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar methodologynowknownorhereafterdeveloped. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authorsortheeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinor foranyerrorsoromissionsthatmayhavebeenmade. Printedonacid-freepaper SpringerScience+BusinessMediaSingaporePteLtd.ispartofSpringerScience+BusinessMedia (www.springer.com) Preface The automatic detection of people’s identity from their voices is part of modern telecommunication services. This generally requires the telephone transmission of speech to remote servers that perform the recognition task. The transmission may introduce severe distortions that degrade the system performance and hence rep- resents one of the major challenges speech technologies are currently facing. Similarly, humans also cope with the difficulty of reliably identifying talkers from speechtransmittedovercommunicationchannels,particularlyiftheutteranceheard is of short duration. This book addresses the evaluation of the human and of the automatic speaker recognition performances under different channel distortions caused by bandwidth limitation, codecs, and electroacoustic user interfaces, among other impairments. The main contribution of this work is the demonstration of the benefits of com- munication channels of extended bandwidth, together with an insight into how speaker-specific characteristics of speech are preserved through different trans- missions. This book intends to assist students, researchers, and engineers to assess the speaker recognition performance employing transmitted speech. Particularly interesting for network engineers, this work provides sufficient motivation for consideringspeakerrecognitionasacriterionforthemigrationfromnarrowbandto enhanced bandwidths, such as wideband and super-wideband. Thisbookwas written inthecontextofmyPh.D. project,a research agreement between the Telekom Innovation Laboratories and the Technische Universität Berlin (TU Berlin), Germany, and the University of Canberra (UC), Australia. It therefore involved periods of research in Berlin, at the Quality and Usability Lab of the TU Berlin, and in Canberra, at the Human-Centred Computing Laboratory of the UC. This work would not have been possible without the contributions of many supporters. I am most thankful to my two main supervisors Prof. Michael Wagner and Prof. Sebastian Möller for their support, direction, and advice throughout the courseofmyPh.D.Theirinsightfulcommentsandfeedbackhaverepeatedlyguided my research focus towards its final form. My gratitude goes also to my third v vi Preface supervisor Associate Professor Roland Göcke, for his constructive comments and inputonmywork.IwishtoexpressmyappreciationtotheUniversityofCanberra fortheresearchgrantawardandtoDeutscheTelekomAGforfullprojectfunding. Iwouldliketothank,amongmycolleaguesinBerlin,thegroupsofspeakersand listeners who volunteered to be recorded and to participate in my auditory tests. This was a kind favour as I required test participants who knew each another and had been exposed for a long period to one another’s voices. Special thanks to Marcel Wältermann and to Janto Skowronek for their assistance with the audio transmissionandreceptionscenariosoftheauditorytestsIconducted.Iwouldalso like to acknowledge Friedemann Köster’s help in estimating quality of speech signals employing instrumental speech quality measures. After having acquired a background on human voice perception and speaker recognisability, I was still new to automatic speaker recognition procedures. I would like to thank my colleague David Vandyke for the quick introduction to speaker recognition methodology when I arrived to Australia, jointly with Prof. Michael Wagner, which saved me considerable time. Last, but not least, thanks to my family for their support, understanding, and patiencewhileIwas17,800kmawayfromhome(thecurrent2,300kmseemstobe a bearable distance), and thanks to my friends, for their always-comforting words. Among other incentives, this encouraged my continuous work and dedication towards the completion of this book. Berlin, June 2015 Laura Fernández Gallardo Contents 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Scope of This Book and Contribution. . . . . . . . . . . . . . . . . . . . 3 1.3 Outline of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Literature Review. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Today’s Communication Channels and Their Main Impairments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Channel Quality Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Subjective Speech Quality Assessment . . . . . . . . . . . . . . 9 2.2.2 Instrumental Speech Quality Measures . . . . . . . . . . . . . . 11 2.2.3 Relations Between Quality and Other Attributes of the Speech Signal. . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Human Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Speech Characteristics Enabling Human Speaker Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.2 Effects of Communication Channels on Human Speaker Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.3 Literature on Human Speaker Recognition and This Book. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Automatic Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Automatic Speaker Recognition Principles and Main Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.2 Effects of Phonetic Content on Automatic Speaker Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.3 Effects of Communication Channels on Automatic Speaker Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.4 NIST Speaker Recognition Evaluations. . . . . . . . . . . . . . 31 vii viii Contents 2.4.5 Comparison Between the Human and the Automatic Speaker Recognition Performance. . . . . . . . . . . . . . . . . . 32 2.4.6 Literature on Automatic Speaker Recognition and This Book. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 Human Speaker Identification Performance Under Channel Degradations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1 Experimental Environment of the Listening Tests. . . . . . . . . . . . 35 3.1.1 Database Collection for the Listening Tests. . . . . . . . . . . 35 3.1.2 Listening Test 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.3 Listening Test 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Effects of Codec and Bandwidth . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Effects of Electro-Acoustic User Interface . . . . . . . . . . . . . . . . . 46 3.4 Effects of Random Packet Loss . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5 Target Speaker and Familiarity. . . . . . . . . . . . . . . . . . . . . . . . . 48 4 Importance of Intelligible Phonemes for Human Speaker Recognition in Different Bandwidths. . . . . . . . . . . . . . . . . . . . . . . 51 4.1 Human Speaker Recognition from Logatomes . . . . . . . . . . . . . . 51 4.1.1 Audio Preparation and Listening Test. . . . . . . . . . . . . . . 52 4.1.2 Accuracies per Logatome in Different Bandwidths . . . . . . 53 4.2 Human Speech Intelligibility from Logatomes . . . . . . . . . . . . . . 57 4.3 Relation Between Speaker Recognition and Intelligibility in Narrowband and in Wideband . . . . . . . . . . . . . . . . . . . . . . . 62 5 Automatic Speaker Verification Performance Under Channel Distortions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.1 Datasets and Speech Transmissions. . . . . . . . . . . . . . . . . . . . . . 64 5.2 Effects of Channel Impairments in Matched Conditions. . . . . . . . 65 5.2.1 GMM-UBM Performance Under Channel Distortions. . . . 67 5.2.2 JFA Performance Under Bandwidth and Codec Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2.3 I-Vectors Performance Under Bandwidth and Codec Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3 Training and Testing Approaches to Reduce Possible Mismatch Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.1 JFA Under Channel Mismatch. . . . . . . . . . . . . . . . . . . . 77 5.3.2 I-Vectors Under Channel Mismatch . . . . . . . . . . . . . . . . 80 6 Detecting Speaker-Discriminative Spectral Content in Wideband for Automatic Speaker Recognition. . . . . . . . . . . . . . 85 6.1 Effects of the Transmission Channel on the Distribution of Speaker-Discriminative Spectral Content . . . . . . . . . . . . . . . . 86 6.1.1 Audio Material. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Contents ix 6.1.2 Spectral Sub-band Analysis. . . . . . . . . . . . . . . . . . . . . . 87 6.1.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.1.4 Sub-band Score-Level Fusion. . . . . . . . . . . . . . . . . . . . . 94 6.2 Different Cepstral Features for Narrowband and for Wideband Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2.1 Speech Filtering and Feature Extraction. . . . . . . . . . . . . . 97 6.2.2 I-Vector Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2.4 Score-Level Fusion of Two Frequency Ranges. . . . . . . . . 105 6.3 Relevance of Phonetic Information Under Transmission Channel Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.3.1 Phoneme Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.3.2 I-Vector Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 109 7 Relations Among Speech Quality, Human Speaker Identification, and Automatic Speaker Verification . . . . . . . . . . . . 113 7.1 Quality and Performance Metrics for Different Channel Degradations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.1.1 Instrumental Quality Measurements . . . . . . . . . . . . . . . . 114 7.1.2 Quality and Speech and Speaker Recognition Performance Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.2 Predicting Human Speaker Identification Performance from Measured Speech Quality. . . . . . . . . . . . . . . . . . . . . . . . . 119 7.2.1 Model Fit with POLQA MOS as Estimator. . . . . . . . . . . 119 7.2.2 Model Fit with DIAL Coloration as Estimator . . . . . . . . . 123 7.2.3 Estimations of Human Speaker Identification Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.3 Predicting Automatic Speaker Verification Performance from Measured Speech Quality. . . . . . . . . . . . . . . . . . . . . . . . . 131 7.3.1 New Instrumental Quality Measurements. . . . . . . . . . . . . 131 7.3.2 Model Fit with POLQA MOS as Estimator. . . . . . . . . . . 133 7.3.3 Model Fit with DIAL Coloration as Estimator . . . . . . . . . 133 7.3.4 Estimations of Automatic Speaker Verification Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.4 Predicting Human Speaker Identification Performance from Computed Speaker Verification EERs . . . . . . . . . . . . . . . . 138 7.4.1 Model Fit with EERs as Estimators . . . . . . . . . . . . . . . . 138 7.4.2 Estimations of Human Speaker Identification Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.