Altering speech synthesis prosody through real time natural gestural control David Abelman Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2013 Abstract A significant amount of research has been and continues to be undertaken into gener- ating expressive prosody within speech synthesis. Separately, recent developments in HMM-based synthesis (specifically pHTS, developed at University of Mons) provide a platform for reactive speech synthesis, able to react in real time to surroundings or userinteraction. Considering both of these elements, this project explores whether it is possible to generatesuperiorprosodyinaspeechsynthesissystem,usingnaturalgesturalcontrols, inrealtime. BuildingonapreviouspieceofworkundertakenatTheUniversityofEd- inburgh,asystemisconstructedinwhichausermayapplyavarietyofprosodiceffects in real time through natural gestures, recognised by a Microsoft Kinect sensor. Ges- tures are recognised and prosodic adjustments made through a series of hand-crafted rules(basedondatagatheredfrompreliminaryexperiments),thoughmachinelearning techniques are also considered within this project and recommended for future itera- tionsofthework. Two sets of formal experiments are implemented, both of which suggest that - un- derfurtherdevelopment-thesystemdevelopedmayworksuccessfullyinarealworld environment. Firstly, user tests show that subjects can learn to control the device suc- cessfully, adding prosodic effects to the intended words in the majority of cases with practice. Results are likely to improve further as buffering issues are resolved. Sec- ondly,listeningtestsshowthattheprosodiceffectscurrentlyimplementedsignificantly increaseperceivednaturalness,andinsomecasesareabletoalterthesemanticpercep- tionofasentenceinanintendedway. Alongside this paper, a demonstration video of the project may be found on the ac- companying CD, or online at http://tinyurl.com/msc-synthesis. The reader is advised to view this demonstration, as a way of understanding how the system functions and soundsinaction. iii Acknowledgements Thank you to my supervisor Rob Clark for his input and assistance throughout the course of this project, the staff and fellow students who have provided useful sugges- tions along the way, and the friends who kindly donated their time to help me as part of the evaluation phase. Finally, a big thank you to all family, friends and loved ones whohavesupportedmeovertheyears. Ireallyappreciateeverything. iv Declaration I declare that this thesis was composed by myself, that the work contained herein is myownexceptwhereexplicitlystatedotherwiseinthetext,andthatthisworkhasnot beensubmittedforanyotherdegreeorprofessionalqualificationexceptasspecified. (DavidAbelman) v Table of Contents 1 Introduction 1 1.1 Problemstatement . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Backgroundandcontext . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Motivationandpotentialapplications . . . . . . . . . . . . . . . . . . 3 2 Previousworkandliterature 5 2.1 HMMsynthesisoverview . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 PerformativeHTS(pHTS) . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 ExpressiveSynthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Inunitselection . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 InHMM-basedsynthesis . . . . . . . . . . . . . . . . . . . . 11 2.4 Prosodyofspeech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.2 Prominencethroughpitchaccents . . . . . . . . . . . . . . . 14 2.4.3 Contrastiveemphasis . . . . . . . . . . . . . . . . . . . . . . 14 2.4.4 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Gestureswithinspeech . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6 Gesturerecognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Design 19 3.1 Choicesandrationales . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.1 Pre-sentencecontrolvs. livegestures . . . . . . . . . . . . . 19 3.1.2 Choiceofprosodiceffects . . . . . . . . . . . . . . . . . . . 20 3.1.3 Realisationofprosodiccontrol . . . . . . . . . . . . . . . . . 21 3.1.4 Parameterstoalter . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.5 Choiceofmotionsensorasinput . . . . . . . . . . . . . . . . 22 3.1.6 Gesturerecognition . . . . . . . . . . . . . . . . . . . . . . . 23 vii 3.1.7 Choiceofgestures . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.8 Naturallanguagemodel . . . . . . . . . . . . . . . . . . . . 25 3.2 Preliminaryteststoestablishkeyparameters . . . . . . . . . . . . . . 26 3.2.1 Contrastive emphasis - analysing pitch and duration shift of audiorecordings . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.2 Reverseengineeringcontrastiveemphasis . . . . . . . . . . . 29 3.2.3 Otherprosodiceffects . . . . . . . . . . . . . . . . . . . . . 30 3.2.4 Beatgesturetiming . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.5 Beatgesturerecognition . . . . . . . . . . . . . . . . . . . . 32 3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1 Frontend . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.3 Additionalvisualoutput . . . . . . . . . . . . . . . . . . . . 35 3.3.4 Prosodicrulesimplemented . . . . . . . . . . . . . . . . . . 36 3.3.5 Implementationissuesanddiscussion . . . . . . . . . . . . . 39 3.3.6 Demonstrationvideo . . . . . . . . . . . . . . . . . . . . . . 42 4 Areastotest 43 5 Experimentalsetup 49 5.1 Generationtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1.1 Setupsummary . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1.2 Sentencedesign . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.1.3 Pilottests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Listeningtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.1 Setupsummary . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.2 Sentencedesign . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.3 Pilottests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6 Resultsandanalysis 53 6.1 Generationtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.1.1 Quantitativeresults . . . . . . . . . . . . . . . . . . . . . . . 53 6.1.2 Qualitativeresults . . . . . . . . . . . . . . . . . . . . . . . 62 6.2 Listeningtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 viii 7 Discussion 77 7.1 Summaryanddiscussionofresults . . . . . . . . . . . . . . . . . . . 77 7.1.1 Generationtest . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.1.2 Listeningtest . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.2 Inthecontextofproblemstatement . . . . . . . . . . . . . . . . . . 81 7.3 Criticalreview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.4 Futurework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 A Flowchartsdescribinggesturalandprosodicrulesimplemented 85 B Generationandlisteningtestsentences 91 B.1 Generationtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 B.2 Listeningtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Bibliography 95 ix
Description: