Sequence Modeling using Gated Recurrent Neural Networks MohammadPezeshki DepartmentofComputerScienceandOperationsResearch UniversitedeMontreal,MontrealQCH3C3J7 5 [email protected] 1 0 2 Abstract n a J In this paper, we have used Recurrent Neural Networks to capture and model 1 human motion data and generate motions by prediction of the next immediate data point at each time-step. Our RNN is armed with recently proposed Gated ] RecurrentUnitswhichhasshownpromissingresultsinsomesequencemodeling E problems such as Machine Translation and Speech Synthesis. We demonstrate N that this model is able to capture long-term dependencies in data and generate . realisticmotions. s c [ 1 Introduction 1 v 9 SequencemodelinghasbeenachallengingprobleminMachineLearningthatrequiresmodelswhich 9 are able to capture temporal dependencies. One of the early models for sequence modeling was 2 HiddenMarkovModel(HMM)[1]. HMMsareabletocapturedatadistributionusingmultinomial 0 latentvariables. Inthismodel,eachdatapointattimetisconditionedonthehiddenstateattimet. 0 Andhiddenstateattimetisconditionedonhiddenstateattimet−1. InHMMsbothP(x |s )and . t t 1 P(s |s ) are same for all time-steps. A similar idea of parameter sharing is used in Recurrent t t−1 0 Neural Network (RNN) [2]. RNNs are an extention of feedforward neural networks which their 5 weights are shared for every time-step in data. Consequently, we can apply RNNs to sequential 1 inputdata. : v Theoretically,RNNsarecapableofcapturingsequenceswitharbitrarycomplexity.Unfortunately,as i X shownbyBengioetal.[3],therearesomedifficultiesduringtrainingRNNsonsequenceswithlong- r term dependencies. Among lots of solutions for RNNs’ training problems over past few decades, a weuseGatedRecurrentUnitswhichisrecentlyproposedbyChoetal. [4]. AsitshownbyChung etal. [5],GatedRecurrentUnitperformsmuchmorebetterthanconventionalTanhunits. Inthefolowingsections,wearegoingtointroducethemodel,trainitontheMITmotiondatabase [6],andshowthatitiscapableofcapturingcomplexitiesofhumanbodymotions. Thenwedemon- stratethatweareabletogeneratesequencesofmotionsbypredictingthenextimmediatedatapoint givenallpreviousdatapoints. 2 RecurrentNeuralNetwork SimpleRecurrentNeuralNetworkwhichhasbeenshowntobeabletoimplementaTuringMachine [7]isanextensionoffeedforwardneuralnetworks. TheideainRNNsisthattheyshareparameters fordifferenttime-steps. ThisideawhichiscalledparametersharingenablesRNNstobeusedfor sequentialdata. RNNs have memory and can memorized input values for some period of time. More formally, if the input sequence is x = {x ,x ,...,x } then each hidden state is function of current input and 1 2 N 1 previoushiddenstate. h =F (h ,x ) t θ t−1 t whichF isalinearregressionfollowedbyanon-linearity. θ h =H(W h ,W x ) t hh t−1 xh t where H is a non-linear function which in Vanilla RNN is conventional Tanh. It can be easily shownusingtheaboveequationthateachhiddenstateh isafunctionofallpreviousinputs. t h =G (x ,x ,...,x ) t θ 1 2 t whereG isaverycomplicatedandnon-linearfunctionwhichsummerizesallpreviousinputsinh . θ t AtrainedG putsmoreemphasisonsomeaspectsofsomeofthepreviousinputstryingtominimize θ overalcostofthenetwork. Finally,inthecaseofreal-valuedoutputs,outputintime-steptcanbecomputedasfollows y =W h t hy t Notethatbiasvectorsareomittedtokeepthenotationsimple. AgraphicalillustrationofRNNsis showninfigure1. Figure1: Left: ARecurrentNeuralNetworkwithrecurrentconnectionfromhiddenunitstothem- selves. Right: Asamenetworkbutunfoldedintime. Notethatweightmatricesaresameforevery time-step. 2.1 GenerativeRecurrentNeuralNetwork WecanuseaRecurrentNeuralNetworkasagenerativemodelinawaythattheoutputofthenetwork intime-stept−1definesaprobabilitydistributionoverthenextinputattime-stept. Accordingto chainrule,wecanwritethejointprobabilitydistributionovertheinputsequenceasfollows. P(x ,x ,...,x )=P(x )P(x |x )...P(x |x ,...,X ) 1 2 N 1 2 1 T 1 T−1 Nowwecanmodeleachoftheseconditionalprobabilitydistributionsasafunctionofhiddenstates. P(x |x ,...,x )=f(h ) t 1 t−1 t Obviously,sinceh isafixedlengthvectorand{x ,...,x }isavariablelengthsequence,itcan t 1 t−1 be considered as a lossy compression. During learning process, the network should learn to keep importantinformation(accordingtothecostfunction)andthrowawayuselessinformation. Thus, inpracticenetworkjustlookatsometime-stepsbackuntilx . ThearchitectureofaGenerative t−k RecurrentNeuralNetworkisshowninfigure2. Unfortunately,asshownbyBengioetal.[3],therearesomeoptimizationissueswhenwetrytotrain suchmodelswithlong-termdependencyindata. Theproblemisthatwhenanerroroccurs, aswe back-propagate it through time to update the parameters, the gradient may decay exponentially to zero(GradientVanishing)orgetexponentiallylarge. Fortheproblemofhugegradients,anad-hoc 2 Figure2:AnunfoldedGenerativeRecurrentNeuralNetworkswhichoutputattime-stept−1defines aconditionalprobabilitydistributionoverthenextinput. Dashed-linesareduringgeneratingphase. solution is to restrict the gradient not to go over a threshold. This technique is known as gradient clipping. But the solution for Gradient Vanishing is not trivial. Over past few decades, several methodswereproposed[e.g. 8, 9, 10]totacklethisproblem. Althoughtheproblemstillremains, gating methods have shown promissing results in comparison with Vanilla RNN in different task such as Speech Recognition [11], Machine Translation [12], and Image Caption Generation [13]. OneofthemodelswhichexploitsagatingmechanismisGatedRecurrentUnit[4]. 2.2 GatedRecurrentUnit GatedRecurrentUnit(GRU)isdifferentfromsimpleRNNinasensethatinGRU,eachhiddenunit hastwogates. Thesegatesarecalledupdateandresetgateswhichcontroltheflowofinformation insideeachhiddenunit. Eachhiddenstateattime-steptiscomputedasfollows, ht =(1−zt)◦ht−1+zt◦(cid:101)ht where◦isanelementwiseproduct,ztisupdategate,and(cid:101)htisthecandidateactivation. (cid:101)ht =tanh(Wxhxt+Whh(rt◦ht−1)) wherer istheresetgate. Bothupdateandresetgatesarecomputedusingasigmoidfunction: t z =σ(W x +W h ) t xz t hz t−1 r =σ(W x +W h ) t xr t hr t−1 whereWsareweightmatricesforbothgates. AGatedRecurrentUnitisshowninfigure3. Figure3: AGatedRecurrentUnitwhichrandzaretheresetandupdategates,andhand(cid:101)harethe activationandthecandidateactivation. [5] 3 Experimentalresults Inthissectionwedescribemotiondatasetandresultsformodelingandgeneratinghumanmotions. 3 3.1 Abilitytocapturelong-termdependency Before training model on motion data let’s first compare GRU with conventional Tanh. As we discussed in section 2, due to optimization problems, simple RNNs are not able to capture long- term dependency in data (Gradient Vanishing problem). Thus, instead of using Tanh activation function, we use GRU. Here we try to show that GRU performs much more better. The task is to read a sequence of random numbers, memorize them for some periods of time, and then emit a function which is sum over input value. We generated 100 different sequences, each containing 20rows(time-steps)and2columns(attributesofeachdatapoint). Wetrainedthemodelssuchthat outputattimet(y )isafunctionofpreviousinputvalues. t y =x [0]+x t−5)[1] t t−3 ( Hence,weexpectmodelstomemorize5time-stepsbackandlearnwhentousewhichdimensions ofthepreviousinputs. Forboth models, inputisa vectorofsize2, outputisascalervalue, anda singlehiddenlayerhas7units. Weallowedbothnetworkstooverfitontrainingdata. Itisshownin figure4thatthemodelwithGRUisabletoperformverywellwhilesimpleTanhcannotcapture. Figure4: Firsttwographsshowtwoinputsignalsofthemodelfor20time-steps. Thirdandfourth graphsareassociatedwithTanhandGRUrespectively. Thesolidlineisrealtargetandthedashed- line is the modeloutput. Asit isclear, thefourth graphperforms muchmore betterand isable to modeltargetsignalverywell. 3.2 Dataset AmongsomeMotionCapture(MOCAP)datasets,weusedsimplewalkingmotionfromMITMo- tion dataset [6]. The dataset is generated by filming a man wearing a cloth with 17 small lights which determine position of body joints. Each data point in our dataset consists of information aboutglobalorientationanddisplacement. Tobeabletogeneratemorerealisticmotions,weused samepreprocessingasusedbyTayloretal. [14]. Ourfinal dataset contains375rowswhereeach rowcontains49ground-invarient,zeromean,andunitvariancefeaturesofbodyjointsduringwalk- ing. WealsousedNeilLawrencesmotioncapturetoolboxtovisualizedatain3Dspcae. Samplesof dataareshowninfigure5. 4 Figure5:AsequenceofframesofwalkingfromMITmotiondataset.Eachframehas17lightpoints onthebody. [6] 3.3 Motiongeneration WetrainedourGRURecurrentNeuralNetworkwhichhas49inputunitsand120hiddenunitsina singlehiddenlayer. Then,weuseitinagenerativefashionwhicheachoutputattimetisfedtothe modelasx . Toinitializethemodel,wefirstfeedthemodelwith50framesofthetrainingdata t+1 andthenletthemodeltogeneratearbitrarylengthsequence. Regenerationqualityisgoodenough inawaythatitcannotbedistinguishedfromrealtriningdatabythenakedeye. Infigure6average over all 49 features is ploted for better visualization. The initialization and generation phases are showninfigure7. Figure6: Average(forvisualization)ofall49featuresisshowninsolidlineforrealtargetandin dashed-lineforgenerateddata. 4 Conclusion InthispaperwehavedemonstratedthatGatedRecurrentUnithelpsoptimizationproblemsofRe- currentNeuralNetworkwhenthereislong-termdependencyindata. Wedidourexperimentsdis- criminatively using a toy example dataset and generatively using MIT motion dataset and showed that GRU performs much better than simple Recurrent Neural Networks with conventional Tanh activationfunctioninbothtasksofmemorizingandgenerating. 5 Figure7: Thisgraphshowstheinitializationusingrealdataforfirst50time-stepsinsolidblueline andtherestisgeneratedsequenceingreendashed-line. 6 References [1]Rabiner,Lawrence,andBiing-HwangJuang.”AnintroductiontohiddenMarkovmodels.”ASSPMagazine, IEEE3.1(1986):4-16. [2]Rumelhart,D.E.,G.E.Hinton,andR.J.Williams. ”LearningInternalRepresentationsbyErrorPropaga- tion,ParallelDistributedProcessing,ExplorationsintheMicrostructureofCognition,ed. DERumelhartand J.McClelland.Vol.1.1986.”(1986). [3] Bengio, Yoshua, Patrice Simard, and Paolo Frasconi. ”Learning long-term dependencies with gradient descentisdifficult.”NeuralNetworks,IEEETransactionson5.2(1994):157-166. [4]2012.URLhttp://icml.cc/discuss/2012/590.html.K.Cho,B.vanMerrienboer,D.Bahdanau,andY.Bengio. Onthepropertiesofneuralmachinetranslation:Encoder-decoderapproaches.arXivpreprintarXiv:1409.1259, 2014. [5]Chung,Junyoung,etal. ”EmpiricalEvaluationofGatedRecurrentNeuralNetworksonSequenceModel- ing.”arXivpreprintarXiv:1412.3555(2014). [6]Hsu,Eugene,KariPulli,andJovanPopovi.”Styletranslationforhumanmotion.”ACMTransactionson Graphics(TOG)24.3(2005):1082-1089. [7]Siegelmann,HavaT.,andEduardoD.Sontag.”Turingcomputabilitywithneuralnets.”AppliedMathemat- icsLetters4.6(1991):77-80. [8]Martens,James,andIlyaSutskever. ”Learningrecurrentneuralnetworkswithhessian-freeoptimization.” Proceedingsofthe28thInternationalConferenceonMachineLearning(ICML-11).2011. [9] Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. ”On the difficulty of training recurrent neural networks.”arXivpreprintarXiv:1211.5063(2012). [10]Bengio,Yoshua,NicolasBoulanger-Lewandowski,andRazvanPascanu. ”Advancesinoptimizingrecur- rentnetworks.”Acoustics,SpeechandSignalProcessing(ICASSP),2013IEEEInternationalConferenceon. IEEE,2013. [11] Graves, Alex, and Navdeep Jaitly. ”Towards end-to-end speech recognition with recurrent neural net- works.”Proceedingsofthe31stInternationalConferenceonMachineLearning(ICML-14).2014. [12]Bahdanau,Dzmitry,KyunghyunCho,andYoshuaBengio.”Neuralmachinetranslationbyjointlylearning toalignandtranslate.”arXivpreprintarXiv:1409.0473(2014). [13]Vinyals,Oriol,etal.”ShowandTell:ANeuralImageCaptionGenerator.”arXivpreprintarXiv:1411.4555 (2014). [14]Taylor,GrahamW.,GeoffreyE.Hinton,andSamT.Roweis.”Modelinghumanmotionusingbinarylatent variables.”Advancesinneuralinformationprocessingsystems.2006. 7