ebook img

Deep Learning Based Speech Quality Prediction PDF

171 Pages·2022·7.372 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Deep Learning Based Speech Quality Prediction

T-Labs Series in Telecommunication Services Gabriel Mittag Deep Learning Based Speech Quality Prediction T-Labs Series in Telecommunication Services SeriesEditors SebastianMo¨ller,QualityandUsabilityLab,TechnischeUniversitätBerlin,Berlin, Germany Axel Ku¨pper, Telekom Innovation Laboratories, Technische Universität Berlin, Berlin,Germany AlexanderRaake,AudiovisualTechnologyGroup,TechnischeUniversitätIlmenau, Ilmenau,Germany It is the aim of the Springer Series in Telecommunication Services to foster an interdisciplinary exchange of knowledge addressing all topics which are essential for developing high-quality and highly usable telecommunication services. This includes basic concepts of underlying technologies, distribution networks, archi- tectures and platforms for service design, deployment and adaptation, as well as the users’perception of telecommunication services.By taking avertical perspec- tiveoverallthesesteps,weaimtoprovidethescientificbasesforthedevelopment and continuous evaluation of innovative services which provide a better value for their users. In fact, the human-centric design of high-quality telecommunication services – the so called “quality engineering” – forms an essential topic of this series,asitwillultimatelyleadtobetteruserexperienceandacceptance.Theseries isdirectedtowardsbothscientistsandpractitionersfromallrelateddisciplinesand industries. **Indexing:booksinthisseriesareindexinginScopus** Moreinformationaboutthisseriesathttps://link.springer.com/bookseries/10013 Gabriel Mittag Deep Learning Based Speech Quality Prediction GabrielMittag TechnischeUniversita¨tBerlin Berlin,Germany ISSN2192-2810 ISSN2192-2829 (electronic) T-LabsSeriesinTelecommunicationServices ISBN978-3-030-91478-3 ISBN978-3-030-91479-0 (eBook) https://doi.org/10.1007/978-3-030-91479-0 ©TheEditor(s)(ifapplicable)andTheAuthor(s),underexclusivelicensetoSpringerNatureSwitzerland AG2022 Thisworkissubjecttocopyright.AllrightsaresolelyandexclusivelylicensedbythePublisher,whether thewholeorpartofthematerialisconcerned,specificallytherightsoftranslation,reprinting,reuse ofillustrations,recitation,broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,and transmissionorinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilar ordissimilarmethodologynowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressedorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Preface Instrumentalspeechqualitypredictionisalong-studiedfieldinwhichmanymodels have been presented. However, in particular, the single-ended prediction without the use of a clean reference signal remains challenging. This book studies how recent developments in machine learning can be leveraged to improve the quality prediction of transmitted speech and additionally provide diagnostic information through the prediction of speech quality dimensions. In particular, different deep learning architectures were analyzed towards their suitability to predict speech quality. To this end, a large dataset with distorted speech files and crowdsourced subjective ratings was created. A number of deep learning architectures, such as CNNs, LSTM networks, and Transformer/self-attention networks, were combined andcompared.ItwasfoundthatanetworkwithCNN,self-attention,andaproposed attention-pooling delivers the best single-ended speech quality predictions on the considered dataset. Furthermore, a double-ended speech quality prediction model basedonaSiameseneuralnetworkispresented.However,itcouldbeshownthat, in contrast to traditional models, deep learning models only slightly benefit from includingthecleanreferencesignal.Forthepredictionofperceptualspeechquality dimensions,amulti-tasklearningbasedmodelispresentedthatpredictstheoverall speech quality and the quality dimensions noisiness, coloration, discontinuity, and loudnessinparallel,wheremostoftheneuralnetworklayersaresharedbetweenthe individualtasks.Finally,thesingle-endedspeechqualitypredictionmodelNISQA ispresentedthatwastrainedonalargevarietyof59differentdatasets.Becausethe trainingdatasetscomefromavarietyofsourcesandcontaindifferentqualityranges, they are exposed to subjective biases. Therefore, the same speech distortions can leadtoverydifferentqualityratingsintwodatasets.Topreventanegativeinfluence of this effect, a bias-aware loss function is proposed that estimates and considers the biases during the training of the neural network weights. The final model was testedonalive-talkingtestsetwithrealrecordedphonecalls,onwhichitachieved aPearson’scorrelationof0.90fortheoverallspeechqualityprediction. Berlin,Germany GabrielMittag v Acknowledgments I am very grateful to the many supporters who have made this work possible. Duringthelastyears,Ihadthepleasuretomeetandgettoknowmanyinteresting peopleattheQualityandUsabilityLab,butalsoatseveralacademicconferences, workshops, and ITU meetings. First, I would like to thank my thesis supervisor Prof. Dr. Sebastian Möller for his support, his scientific expertise, and his advice thatgreatlyhelpedmetowriteandcompletethisbook.Myspecialthanksalsogo toDr.FriedemannKöster,whointroducedmetotheexcitingfieldofspeechquality estimationandwithoutwhomIprobablywouldnothavestartedmydoctoralstudies. IwouldliketothankmystudentassistantLouisLiedtkeforhissupportandalsoall thestudentsIhadthepleasuretosuperviseduringtheirbachelor’sormaster’stheses preparation,inparticularAssmaaChehadiforherworkononeofthedatasetsused in this book and Huahua Maier on his work on the Android recording app. I also wanttothankProf.Dr.GerhardSchmidtandTobiasHübschenfromtheUniversity of Kiel and Dr. Jens Berger for the great collaboration during the DFG project. I would like to thank Prof. Tiago H. Falk. and, a second time, Prof. Dr. Gerhard Schmidtforreviewingthisbookandforservingonmydoctoralcommittee.Many thanks go to Irene Hube-Achter, Yasmin Hillebrenner, and Tobias Jettkowski for their excellent administrative and technical support. Thanks to all my former and current colleagues at the Quality and Usability Lab for the numerous discussions, exchangeofresearchideas,andforkeepingmecompanyduringmycoffeebreaks and making sure that it never got too boring at the lab, including Steven Schmidt, SamanZadtootaghaj,SaiSirishaRallabandi,ThiloMichael,TanjaKojic,Dr.Babak Naderi,Dr.LauraFernándezGallardo,Dr.PatrickEhrenbrink,Dr.DennisGuse,Dr. MaijaPoikela,Dr.FalkSchiffner,andDr.StefanUhrig,andmanymore.Thankyou allforagreattime! vii Contents 1 Introduction .................................................................. 1 1.1 Motivation............................................................... 1 1.2 ThesisObjectivesandResearchQuestions............................ 4 1.3 Outline................................................................... 4 2 QualityAssessmentofTransmittedSpeech............................... 7 2.1 SpeechCommunicationNetworks..................................... 7 2.2 SpeechQualityandSpeechQualityDimensions...................... 10 2.3 SubjectiveAssessment.................................................. 12 2.4 SubjectiveAssessmentviaCrowdsourcing............................ 16 2.5 TraditionalInstrumentalMethods...................................... 18 2.5.1 ParametricModels.............................................. 18 2.5.2 Double-EndedSignal-BasedModels .......................... 19 2.5.3 Single-EndedSignal-BasedModels ........................... 21 2.6 MachineLearningBasedInstrumentalMethods ...................... 22 2.6.1 Non-DeepLearningMachineLearningApproaches.......... 23 2.6.2 DeepLearningArchitectures................................... 24 2.6.3 DeepLearningBasedSpeechQualityModels................ 28 2.7 Summary ................................................................ 31 3 NeuralNetworkArchitecturesforSpeechQualityPrediction.......... 33 3.1 Dataset................................................................... 33 3.1.1 SourceFiles..................................................... 34 3.1.2 SimulatedDistortions........................................... 35 3.1.3 LiveDistortions................................................. 39 3.1.4 ListeningExperiment........................................... 40 3.2 OverviewofNeuralNetworkModel................................... 42 3.3 Mel-SpecSegmentation ................................................ 43 3.4 FramewiseModel....................................................... 43 3.4.1 CNN............................................................. 43 3.4.2 FeedforwardNetwork .......................................... 45 ix x Contents 3.5 Time-DependencyModelling .......................................... 46 3.5.1 LSTM ........................................................... 47 3.5.2 Transformer/Self-Attention .................................... 48 3.6 TimePooling............................................................ 51 3.6.1 Average-/Max-Pooling ......................................... 51 3.6.2 Last-Step-Pooling............................................... 51 3.6.3 Attention-Pooling............................................... 52 3.7 ExperimentsandResults................................................ 53 3.7.1 TrainingandEvaluationMetric................................ 53 3.7.2 FramewiseModel............................................... 54 3.7.3 Time-DependencyModel ...................................... 56 3.7.4 PoolingModel .................................................. 57 3.8 Summary ................................................................ 58 4 Double-EndedSpeechQualityPredictionUsingSiameseNetworks... 59 4.1 Introduction ............................................................. 59 4.2 Method................................................................... 60 4.2.1 SiameseNeuralNetwork....................................... 62 4.2.2 ReferenceAlignment........................................... 62 4.2.3 FeatureFusion .................................................. 64 4.3 Results................................................................... 65 4.3.1 LSTMvsSelf-Attention........................................ 65 4.3.2 Alignment....................................................... 66 4.3.3 FeatureFusion .................................................. 67 4.3.4 Double-EndedvsSingle-Ended................................ 68 4.4 Summary ................................................................ 70 5 PredictionofSpeechQualityDimensionswithMulti-Task Learning ...................................................................... 73 5.1 Introduction ............................................................. 73 5.2 Multi-TaskModels...................................................... 75 5.2.1 FullyConnected(MTL-FC).................................... 76 5.2.2 FullyConnected+Pooling(MTL-POOL) .................... 77 5.2.3 FullyConnected+Pooling+Time-Dependency (MTL-TD)....................................................... 78 5.2.4 FullyConnected+Pooling+Time-Dependency+ CNN(MTL-CNN).............................................. 79 5.3 Results................................................................... 79 5.3.1 Per-TaskEvaluation ............................................ 80 5.3.2 All-TasksEvaluation ........................................... 83 5.3.3 ComparingDimension ......................................... 84 5.3.4 DegradationDecomposition.................................... 85 5.4 Summary ................................................................ 87 Contents xi 6 Bias-AwareLossforTrainingfromMultipleDatasets................... 89 6.1 Method................................................................... 90 6.1.1 LearningwithBias-AwareLoss................................ 91 6.1.2 AnchoringPredictions.......................................... 92 6.2 ExperimentsandResults................................................ 93 6.2.1 SyntheticData .................................................. 94 6.2.2 MinimumAccuracyr ......................................... 95 th 6.2.3 TrainingExampleswithandWithoutAnchoring ............. 96 6.2.4 ConfigurationComparisons.................................... 97 6.2.5 SpeechQualityDataset......................................... 99 6.3 Summary ................................................................ 100 7 NISQA:ASingle-EndedSpeechQualityModel.......................... 103 7.1 Datasets.................................................................. 103 7.1.1 POLQAPool.................................................... 104 7.1.2 ITU-TPSuppl.23.............................................. 104 7.1.3 OtherDatasets .................................................. 106 7.1.4 Live-TalkingTestSet........................................... 109 7.2 ModelandTraining..................................................... 110 7.2.1 Model............................................................ 110 7.2.2 Bias-AwareLoss................................................ 111 7.2.3 HandlingMissingDimensionRatings......................... 111 7.2.4 Training ......................................................... 112 7.3 Results................................................................... 114 7.3.1 EvaluationMetrics.............................................. 115 7.3.2 ValidationSetResults:OverallQuality ....................... 116 7.3.3 ValidationSetResults:QualityDimensions................... 118 7.3.4 TestSetResults................................................. 122 7.3.5 ImpairmentLevelvsQualityPrediction....................... 125 7.4 Summary ................................................................ 138 8 Conclusions................................................................... 141 A DatasetConditionTables.................................................... 145 B TrainandValidationDatasetDimensionHistograms ................... 149 References......................................................................... 153 Index............................................................................... 161

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.