Alignment of Speech and Co-speech Gesture in a Constraint-based Grammar Katya Alahverdzhieva U N I VE R E S I H T T Y O H G F R E U DI N B Doctor of Philosophy Institute for Language, Cognition and Computation School of Informatics University of Edinburgh 2013 Abstract This thesis concerns the form-meaning mapping of multimodal communicative ac- tions consistingof speech signals and improvised co-speech gestures, produced spon- taneously with the hand. The interaction between speech and speech-accompanying gestures has been standardly addressed from a cognitive perspective to establish the underlyingcognitivemechanisms for thesynchronous speech and gestureproduction, and also from a computational perspective to build computer systems that communi- catethroughmultiplemodalities. Based onthefindingsofthispreviousresearch, weadvanceanewtheoryinwhich themappingfromtheformofthecombinedspeech-and-gesturesignaltoitsmeaningis analysedinaconstraint-basedmultimodalgrammar. Weproposeseveralconstruction rules about multimodal well-formedness that we motivate empirically from an exten- sive and detailed corpus study. In particular, the construction rules use the prosody, syntax and semantics of speech, the form and meaning of the gesture signal, as well as thetemporal performance ofthespeech relativeto thetemporal performance ofthe gesture to constrain the derivation of a single multimodal syntax tree which in turn determines a meaning representation via standard mechanisms for semantic composi- tion. Gesturalformoftenunderspecifiesitsmeaning,andsotheoutputofourgrammar is underspecified logicalformulaethat support therange of possibleinterpretationsof the multimodal act in its final context-of-use, given the current models of the seman- tics/pragmaticsinterface. Itisstandardlyheldinthegesturecommunitythattheco-expressivityofspeechand gesture is determined on the basis of their temporal co-occurrence: that is, a gesture signal is semantically related to the speech signal that happened at the same time as the gesture. Whereas this is usually taken for granted, we propose a methodology of establishing in a systematic and domain-independent way which spoken element(s) gesture can be semantically related to, based on their form, so as to yield a meaning representation that supports the intended interpretation(s) in context. The ‘semantic’ alignment of speech and gesture is thus driven not from the temporal co-occurrence alone,butalsofromthelinguisticpropertiesofthespeechsignalgestureoverlapswith. In so doing, we contribute a fine-grained system for articulating the form-meaning mappingofmultimodalactionsthatuses standardmethodsfromlinguistics. Weshowthatjustaslanguageexhibitsambiguityinbothformandmeaning,sodo multimodalactions: forinstance,theintegrationofgestureisnotrestrictedtoaunique speech phrase but rather speech and gesture can be aligned in multiple multimodal iii syntax trees thus yielding distinct meaning representations. These multiple mappings stemfromthefactthatthemeaningas derivedfromgestureformishighlyincomplete evenincontext. Anoverallchallengeisthustoaccountfortherangeofpossibleinter- pretationsofthemultimodalactionincontextusingstandardmethodsfromlinguistics forsyntacticderivationand semanticcomposition. iv Acknowledgements Onlywiththehelpofmanypeoplewas Iabletocompletethisjourney. Firstand fore- most, I would like to thank my supervisor, Professor Alex Lascarides, who mentored me on this rather unconventional topic, guided me and supported me throughout the years. For her patience, brilliant ideas and immediate response. I feel lucky, grateful andprivilegedforhaving workedwithAlex. I am deeply grateful to Daniel Loehr who provided me with his collection of an- notatedvideorecordings. ToMichaelKippwhokindlyprovidedmewiththelabelling tool Anvil and offered me technical assistance whenever necessary. I would also like to thank the participants who did the gesture annotation in my experiment. I grate- fully acknowledge Mark Steedman and Ewan Klein who gave me extremely helpful comments and ideas that I used in this thesis. Also big thanks to Jean Carletta and Jonathan Kilgour who helped me a lot with the NXT tool. I also benefited a lot from discussions over mail with Sasha Calhoun. In 2010, I participated in the first school in Gesture Studies, where I was lucky to meet and talk about my work with Adam Kendon,MandanaSeyfeddinipur,and thesummerschoolparticipants. I gratefully acknowledge Dan Flickinger who made the initially impossible task of multimodal grammar implementation possible. It is thanks to him that the imple- mentedgrammarcametolife. BigthankstoEmilyBenderforherhelpwiththegram- mar engineering challenge. To the people from the DELPH-IN community—Ulrich Scha¨fer, PeterAdolphs,BertholdCrysmann—whogavemeusefulinstructionshowto handle the grammar engineering platforms. And Stephan Oepen who helped me with the grammar profiling system. I would also like to express my gratitude to Nicholas AsherwhoofferedmegreathelpwithSDRT.Iwouldalsoliketothanktheanonymous reviewers of my submissions for the insightful comments. And my office mates who contributedthenice,friendly andquietenvironmentin ouroffice: Yansong,Jeff,Neil, Sharon, Ioannis, Sean. All images should be attributed to Tudor Thomas, who did an excellentworkin turningmyscreenshotsintheselivelydrawings. I am deeply grateful to my examiners, Professor Bob Ladd and Dr. Michael John- ston, for their insightful comments and criticism, and also for turning my viva into a veryrelaxed and enjoyableexperience. ManythanksareduetoEPSRCwhoprovidedthefundingformyPhDstudies,and alsoto theJAMESprojectthat fundedtwoofmy conferences. And of course, to my mum who always encouraged me in my pursuits and never questioned my choice to live far from home, to my sister who showed me that PhDs v are accomplishable. And to Herve´ Saint-Amand, for being next to me and for his unconditionalloveand support. vi Declaration I declare that this thesis was composed by myself, that the work contained herein is myownexceptwhereexplicitlystatedotherwiseinthetext,andthatthisworkhasnot been submittedforanyotherdegreeorprofessionalqualificationexcept as specified. (Katya Alahverdzhieva) vii Table of Contents 1 Introduction 1 1.1 What ThisThesisisAbout . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 MainProperties ofCo-speech Gesture . . . . . . . . . . . . . 3 1.1.2 ThesisAims . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.3 Steps toAchieveOurAims . . . . . . . . . . . . . . . . . . . 8 1.2 What ThisThesisinNot About . . . . . . . . . . . . . . . . . . . . . 9 1.3 Whya MultimodalGrammar? . . . . . . . . . . . . . . . . . . . . . 9 1.3.1 AssumptionsAbouttheModelofPragmaticTheory . . . . . 10 1.3.2 EmpiricalEvidence . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.3 Cognition: an InseparableSystem . . . . . . . . . . . . . . . 20 1.4 Speech-Gesture Alignment . . . . . . . . . . . . . . . . . . . . . . . 21 1.5 ThisThesisin Context . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.6 ThesisOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.7 PublishedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2 Data 29 2.1 GestureBackground . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.1.1 GestureDimensions . . . . . . . . . . . . . . . . . . . . . . 30 2.1.2 Structural OrganisationofGesture . . . . . . . . . . . . . . . 36 2.2 Main Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.2.1 Range ofAmbiguity . . . . . . . . . . . . . . . . . . . . . . 38 2.2.2 Nota Free-for-All . . . . . . . . . . . . . . . . . . . . . . . 51 2.3 Summary andNextSteps . . . . . . . . . . . . . . . . . . . . . . . . 51 3 Related Literature 53 3.1 First Accounts: HistoricalNotes . . . . . . . . . . . . . . . . . . . . 54 3.2 Speech-Gesture Integration: adescriptiveaccount . . . . . . . . . . . 55 ix 3.2.1 IntegratedMessageofSpoken andGestural Material . . . . . 55 3.2.2 Gestures asa GlobalPerformance Dependenton Context . . . 57 3.2.3 Relationshipbetween Gestureand Intonation . . . . . . . . . 58 3.2.4 Relationshipbetween Gestureand SyntacticConstituency . . 60 3.2.5 MultimodalTiming . . . . . . . . . . . . . . . . . . . . . . . 63 3.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.3 Gesturefrom aCognitivePerspective . . . . . . . . . . . . . . . . . 64 3.3.1 Gestureas aProduct ofCognitiveProcesses . . . . . . . . . . 65 3.3.2 CommunicativeIntentionalityofGestures . . . . . . . . . . . 67 3.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.4 ComputationalModelsofGesture . . . . . . . . . . . . . . . . . . . 69 3.4.1 MultimodalParsing . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.2 MultimodalGeneration . . . . . . . . . . . . . . . . . . . . . 74 3.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.5 ExistingFormal ModelsofMultimodalSyntax . . . . . . . . . . . . 78 3.5.1 An HPSG-based IntegrationofSpeech and Deixis . . . . . . 79 3.5.2 A MultimodalGrammarforGerman . . . . . . . . . . . . . 81 3.5.3 IntegrationofSpeechandGestureinUnification-basedGram- mars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4 Empirical Investigation 87 4.1 Corpora andAnnotation . . . . . . . . . . . . . . . . . . . . . . . . 89 4.1.1 Corpus ofLoehr[2004] . . . . . . . . . . . . . . . . . . . . 89 4.1.2 Talkbankand AMICorpora . . . . . . . . . . . . . . . . . . 91 4.1.3 MultimodalCorporainNXT . . . . . . . . . . . . . . . . . . 97 4.2 DepictingGestures . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.2.1 Aimand Method . . . . . . . . . . . . . . . . . . . . . . . . 98 4.2.2 Resultsand Discussion . . . . . . . . . . . . . . . . . . . . . 101 4.3 DeicticGestures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.3.1 Aimand Method . . . . . . . . . . . . . . . . . . . . . . . . 109 4.3.2 Resultsand Discussion . . . . . . . . . . . . . . . . . . . . . 109 4.4 Beats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.5 Conclusionsand NextSteps . . . . . . . . . . . . . . . . . . . . . . 118 x
Description: