Table Of Content

T-Labs Series in Telecommunication Services Gabriel Mittag Deep Learning Based Speech Quality Prediction T-Labs Series in Telecommunication Services SeriesEditors SebastianMo¨ller,QualityandUsabilityLab,TechnischeUniversitätBerlin,Berlin, Germany Axel Ku¨pper, Telekom Innovation Laboratories, Technische Universität Berlin, Berlin,Germany AlexanderRaake,AudiovisualTechnologyGroup,TechnischeUniversitätIlmenau, Ilmenau,Germany It is the aim of the Springer Series in Telecommunication Services to foster an interdisciplinary exchange of knowledge addressing all topics which are essential for developing high-quality and highly usable telecommunication services. This includes basic concepts of underlying technologies, distribution networks, architectures and platforms for service design, deployment and adaptation, as well as the users’perception of telecommunication services.By taking avertical perspec- tiveoverallthesesteps,weaimtoprovidethescientificbasesforthedevelopment and continuous evaluation of innovative services which provide a better value for their users. In fact, the human-centric design of high-quality telecommunication services – the so called “quality engineering” – forms an essential topic of this series,asitwillultimatelyleadtobetteruserexperienceandacceptance.Theseries isdirectedtowardsbothscientistsandpractitionersfromallrelateddisciplinesand industries. **Indexing:booksinthisseriesareindexinginScopus** Moreinformationaboutthisseriesathttps://link.springer.com/bookseries/10013 Gabriel Mittag Deep Learning Based Speech Quality Prediction GabrielMittag TechnischeUniversita¨tBerlin Berlin,Germany ISSN2192-2810 ISSN2192-2829 (electronic) T-LabsSeriesinTelecommunicationServices ISBN978-3-030-91478-3 ISBN978-3-030-91479-0 (eBook) https://doi.org/10.1007/978-3-030-91479-0 ©TheEditor(s)(ifapplicable)andTheAuthor(s),underexclusivelicensetoSpringerNatureSwitzerland AG2022 Thisworkissubjecttocopyright.AllrightsaresolelyandexclusivelylicensedbythePublisher,whether thewholeorpartofthematerialisconcerned,specificallytherightsoftranslation,reprinting,reuse ofillustrations,recitation,broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,and transmissionorinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilar ordissimilarmethodologynowknownorhereafterdeveloped. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor theeditorsgiveawarranty,expressedorimplied,withrespecttothematerialcontainedhereinorforany errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional claimsinpublishedmapsandinstitutionalaffiliations. ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSwitzerlandAG Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland Preface Instrumentalspeechqualitypredictionisalong-studiedfieldinwhichmanymodels have been presented. However, in particular, the single-ended prediction without the use of a clean reference signal remains challenging. This book studies how recent developments in machine learning can be leveraged to improve the quality prediction of transmitted speech and additionally provide diagnostic information through the prediction of speech quality dimensions. In particular, different deep learning architectures were analyzed towards their suitability to predict speech quality. To this end, a large dataset with distorted speech files and crowdsourced subjective ratings was created. A number of deep learning architectures, such as CNNs, LSTM networks, and Transformer/self-attention networks, were combined andcompared.ItwasfoundthatanetworkwithCNN,self-attention,andaproposed attention-pooling delivers the best single-ended speech quality predictions on the considered dataset. Furthermore, a double-ended speech quality prediction model basedonaSiameseneuralnetworkispresented.However,itcouldbeshownthat, in contrast to traditional models, deep learning models only slightly benefit from includingthecleanreferencesignal.Forthepredictionofperceptualspeechquality dimensions,amulti-tasklearningbasedmodelispresentedthatpredictstheoverall speech quality and the quality dimensions noisiness, coloration, discontinuity, and loudnessinparallel,wheremostoftheneuralnetworklayersaresharedbetweenthe individualtasks.Finally,thesingle-endedspeechqualitypredictionmodelNISQA ispresentedthatwastrainedonalargevarietyof59differentdatasets.Becausethe trainingdatasetscomefromavarietyofsourcesandcontaindifferentqualityranges, they are exposed to subjective biases. Therefore, the same speech distortions can leadtoverydifferentqualityratingsintwodatasets.Topreventanegativeinfluence of this effect, a bias-aware loss function is proposed that estimates and considers the biases during the training of the neural network weights. The final model was testedonalive-talkingtestsetwithrealrecordedphonecalls,onwhichitachieved aPearson’scorrelationof0.90fortheoverallspeechqualityprediction. Berlin,Germany GabrielMittag v Acknowledgments I am very grateful to the many supporters who have made this work possible. Duringthelastyears,Ihadthepleasuretomeetandgettoknowmanyinteresting peopleattheQualityandUsabilityLab,butalsoatseveralacademicconferences, workshops, and ITU meetings. First, I would like to thank my thesis supervisor Prof. Dr. Sebastian Möller for his support, his scientific expertise, and his advice thatgreatlyhelpedmetowriteandcompletethisbook.Myspecialthanksalsogo toDr.FriedemannKöster,whointroducedmetotheexcitingfieldofspeechquality estimationandwithoutwhomIprobablywouldnothavestartedmydoctoralstudies. IwouldliketothankmystudentassistantLouisLiedtkeforhissupportandalsoall thestudentsIhadthepleasuretosuperviseduringtheirbachelor’sormaster’stheses preparation,inparticularAssmaaChehadiforherworkononeofthedatasetsused in this book and Huahua Maier on his work on the Android recording app. I also wanttothankProf.Dr.GerhardSchmidtandTobiasHübschenfromtheUniversity of Kiel and Dr. Jens Berger for the great collaboration during the DFG project. I would like to thank Prof. Tiago H. Falk. and, a second time, Prof. Dr. Gerhard Schmidtforreviewingthisbookandforservingonmydoctoralcommittee.Many thanks go to Irene Hube-Achter, Yasmin Hillebrenner, and Tobias Jettkowski for their excellent administrative and technical support. Thanks to all my former and current colleagues at the Quality and Usability Lab for the numerous discussions, exchangeofresearchideas,andforkeepingmecompanyduringmycoffeebreaks and making sure that it never got too boring at the lab, including Steven Schmidt, SamanZadtootaghaj,SaiSirishaRallabandi,ThiloMichael,TanjaKojic,Dr.Babak Naderi,Dr.LauraFernándezGallardo,Dr.PatrickEhrenbrink,Dr.DennisGuse,Dr. MaijaPoikela,Dr.FalkSchiffner,andDr.StefanUhrig,andmanymore.Thankyou allforagreattime! vii Contents 1 Introduction .................................................................. 1 1.1 Motivation............................................................... 1 1.2 ThesisObjectivesandResearchQuestions............................ 4 1.3 Outline................................................................... 4 2 QualityAssessmentofTransmittedSpeech............................... 7 2.1 SpeechCommunicationNetworks..................................... 7 2.2 SpeechQualityandSpeechQualityDimensions...................... 10 2.3 SubjectiveAssessment.................................................. 12 2.4 SubjectiveAssessmentviaCrowdsourcing............................ 16 2.5 TraditionalInstrumentalMethods...................................... 18 2.5.1 ParametricModels.............................................. 18 2.5.2 Double-EndedSignal-BasedModels .......................... 19 2.5.3 Single-EndedSignal-BasedModels ........................... 21 2.6 MachineLearningBasedInstrumentalMethods ...................... 22 2.6.1 Non-DeepLearningMachineLearningApproaches.......... 23 2.6.2 DeepLearningArchitectures................................... 24 2.6.3 DeepLearningBasedSpeechQualityModels................ 28 2.7 Summary ................................................................ 31 3 NeuralNetworkArchitecturesforSpeechQualityPrediction.......... 33 3.1 Dataset................................................................... 33 3.1.1 SourceFiles..................................................... 34 3.1.2 SimulatedDistortions........................................... 35 3.1.3 LiveDistortions................................................. 39 3.1.4 ListeningExperiment........................................... 40 3.2 OverviewofNeuralNetworkModel................................... 42 3.3 Mel-SpecSegmentation ................................................ 43 3.4 FramewiseModel....................................................... 43 3.4.1 CNN............................................................. 43 3.4.2 FeedforwardNetwork .......................................... 45 ix x Contents 3.5 Time-DependencyModelling .......................................... 46 3.5.1 LSTM ........................................................... 47 3.5.2 Transformer/Self-Attention .................................... 48 3.6 TimePooling............................................................ 51 3.6.1 Average-/Max-Pooling ......................................... 51 3.6.2 Last-Step-Pooling............................................... 51 3.6.3 Attention-Pooling............................................... 52 3.7 ExperimentsandResults................................................ 53 3.7.1 TrainingandEvaluationMetric................................ 53 3.7.2 FramewiseModel............................................... 54 3.7.3 Time-DependencyModel ...................................... 56 3.7.4 PoolingModel .................................................. 57 3.8 Summary ................................................................ 58 4 Double-EndedSpeechQualityPredictionUsingSiameseNetworks... 59 4.1 Introduction ............................................................. 59 4.2 Method................................................................... 60 4.2.1 SiameseNeuralNetwork....................................... 62 4.2.2 ReferenceAlignment........................................... 62 4.2.3 FeatureFusion .................................................. 64 4.3 Results................................................................... 65 4.3.1 LSTMvsSelf-Attention........................................ 65 4.3.2 Alignment....................................................... 66 4.3.3 FeatureFusion .................................................. 67 4.3.4 Double-EndedvsSingle-Ended................................ 68 4.4 Summary ................................................................ 70 5 PredictionofSpeechQualityDimensionswithMulti-Task Learning ...................................................................... 73 5.1 Introduction ............................................................. 73 5.2 Multi-TaskModels...................................................... 75 5.2.1 FullyConnected(MTL-FC).................................... 76 5.2.2 FullyConnected+Pooling(MTL-POOL) .................... 77 5.2.3 FullyConnected+Pooling+Time-Dependency (MTL-TD)....................................................... 78 5.2.4 FullyConnected+Pooling+Time-Dependency+ CNN(MTL-CNN).............................................. 79 5.3 Results................................................................... 79 5.3.1 Per-TaskEvaluation ............................................ 80 5.3.2 All-TasksEvaluation ........................................... 83 5.3.3 ComparingDimension ......................................... 84 5.3.4 DegradationDecomposition.................................... 85 5.4 Summary ................................................................ 87 Contents xi 6 Bias-AwareLossforTrainingfromMultipleDatasets................... 89 6.1 Method................................................................... 90 6.1.1 LearningwithBias-AwareLoss................................ 91 6.1.2 AnchoringPredictions.......................................... 92 6.2 ExperimentsandResults................................................ 93 6.2.1 SyntheticData .................................................. 94 6.2.2 MinimumAccuracyr ......................................... 95 th 6.2.3 TrainingExampleswithandWithoutAnchoring ............. 96 6.2.4 ConfigurationComparisons.................................... 97 6.2.5 SpeechQualityDataset......................................... 99 6.3 Summary ................................................................ 100 7 NISQA:ASingle-EndedSpeechQualityModel.......................... 103 7.1 Datasets.................................................................. 103 7.1.1 POLQAPool.................................................... 104 7.1.2 ITU-TPSuppl.23.............................................. 104 7.1.3 OtherDatasets .................................................. 106 7.1.4 Live-TalkingTestSet........................................... 109 7.2 ModelandTraining..................................................... 110 7.2.1 Model............................................................ 110 7.2.2 Bias-AwareLoss................................................ 111 7.2.3 HandlingMissingDimensionRatings......................... 111 7.2.4 Training ......................................................... 112 7.3 Results................................................................... 114 7.3.1 EvaluationMetrics.............................................. 115 7.3.2 ValidationSetResults:OverallQuality ....................... 116 7.3.3 ValidationSetResults:QualityDimensions................... 118 7.3.4 TestSetResults................................................. 122 7.3.5 ImpairmentLevelvsQualityPrediction....................... 125 7.4 Summary ................................................................ 138 8 Conclusions................................................................... 141 A DatasetConditionTables.................................................... 145 B TrainandValidationDatasetDimensionHistograms ................... 149 References......................................................................... 153 Index............................................................................... 161

Deep Learning Based Speech Quality Prediction PDF

171 Pages·2022·7.372 MB·English

by Gabriel Mittag

#Technique #Communication: Telecommunications

Checking for file health...

Save to my drive

Quick download

Download

Download Deep Learning Based Speech Quality Prediction PDF Free - Full Version

by Gabriel Mittag| 2022| 171 pages| 7.372| English

Download Deep Learning Based Speech Quality Prediction by Gabriel Mittag in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Deep Learning Based Speech Quality Prediction

No description available for this book.

Detailed Information

Author:	Gabriel Mittag
Publication Year:	2022
ISBN:	9783030914783
Pages:	171
Language:	English
File Size:	7.372
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Deep Learning Based Speech Quality Prediction Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Deep Learning Based Speech Quality Prediction PDF?

Yes, on https://PDFdrive.to you can download Deep Learning Based Speech Quality Prediction by Gabriel Mittag completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Deep Learning Based Speech Quality Prediction on my mobile device?

After downloading Deep Learning Based Speech Quality Prediction PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Deep Learning Based Speech Quality Prediction?

Yes, this is the complete PDF version of Deep Learning Based Speech Quality Prediction by Gabriel Mittag. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Deep Learning Based Speech Quality Prediction PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.