Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading ChunlinTian,WeijunJi UniversityoftheChineseAcademyofSciences,19AYuquanlu,Beijing,100049,P.R.China 7 {tianchunlin15, jiweijun15}@mails.ucas.ac.cn 1 0 2 r Abstract a M TheAduio-visualSpeechRecognition(AVSR)whichemploysboththevideoand 7 audioinformationtodoAutomaticSpeechRecognition(ASR)isoneoftheappli- 1 cationofmultimodalleaningmakingASRsystemmorerobustandaccuracy. The traditionalmodelsusuallytreatedAVSRasinferenceorprojectionbutstrictprior ] limits its ability. As the revival of deep learning, Deep Neural Networks (DNN) V becomes an important toolkit in many traditional classification tasks including C ASR,imageclassification,naturallanguageprocessing. SomeDNNmodelswere . usedinAVSRlikeMultimodalDeepAutoencoders(MDAEs), MultimodalDeep s c BeliefNetwork(MDBN)andMultimodalDeepBoltzmannMachine(MDBM)that [ actuallyworkbetterthantraditionalmethods. However,suchDNNmodelshave severalshortcomings: (1)Theydon’tbalancethemodalfusionandtemporalfu- 2 sion, or even haven’t temporal fusion; (2)The architecture of these models isn’t v end-to-end, the training and testing getting cumbersome. We propose a DNN 4 2 model, Auxiliary Multimodal LSTM (am-LSTM), to overcome such weakness. 2 Theam-LSTMcouldbetrainedandtestedinonetime,alternativelyeasytotrain 4 andpreventingoverfittingautomatically. Theextensibilityandflexibilityarealso 0 take into consideration. The experiments shows that am-LSTM is much better . than traditional methods and other DNN models in three datasets: AVLetters, 1 AVLetters2,AVDigits. 0 7 1 : 1 Introduction v i X AutomaticSpeechRecognition(ASR)hasbeeninvestigatedoverseveralyears,andthereisawealth r of literature. The recent progress is Deep Speech2 [3], which utilizes deep Convolution Neural a Network (CNN)[10], LSTM[9] and CTC [7], and sequence-to-sequence models [26]. Although ASR achieved excellent result, it had some intrinsical problems including insufficient tolerance of noiseanddisturbance. Besides,someillusionoccurswhentheauditorycomponentofonesoundis pairedwiththevisualcomponentofanothersound,leadingtotheperceptionofathirdsound[15]. ThiswasdescribedasMcGurkeffect(McGurk&MacDonald1976). Benefitedfromthemultimodallearning,Audio-visualSpeechRecognition(AVSR)isasupplement toASRthatmixesaudioandvisualinformationtogether, andamountofworkverifiedthatAVSR strengthenedtheASRsystem[17][2]. WhenitcomestoAVSR,thecorepartismultimodallearn- ing. Intheearlydecades,manymodelsaimedtofusethemultimodaldatamorerepresentativeand discriminative,whichcontainmultimodalextensionsofHiddenMarkovModels(HMMs)andsome statistical models. But strong prior assumption limits such models. As the revival of Neural Net- works,somedeepmodelswereproposedinmultimodallearning. Differentfromtraditionalmodels, deepmodelsusuallyhavetwomainaspects: (1) Many researchers think of Deep Neural Networks (DNN) as performing a kind of good repre- sentation learning [5], that is, the ability to extract feature of DNN can be exploited for varieties 1 of tasks, especially the CNN for image feature extraction. For aural information, it is refined by some deep models including Deep Belief Network (DBN), Deep Autoencoder (DAE), Restricted BoltzmannMachines(RBMs)andDeepBottleneckFeatures(DBNF)[14][18][27]. (2) The other merit of deep models is the well-performed fusion, conquering the biggest disad- vantage of traditional methods. Multimodal DBN (MDBN)[23], Multimodal DAEs (MDAEs)[17], MultimodalDeepBoltzmannMachine(MDBM)[24]areallunsupervisedDNNthatperformfusion. However,theaforementioneddeepmodelshavetwoprimaryshortcomings: (1)Theydon’tbalancethemodalfusionandtemporalfusion,orevenhaven’ttemporalfusion.Many methodssimplyconcatenatethefeaturesofframesinasinglevideotodotemporalfusion. Besides, inthemodalfusion,suchmethodsdon’tconsiderthecorrelationamongdifferentframes. (2)Thearchitectureofthesemodelsisn’tend-to-end,thetrainingandtestinggettingcumbersome. InspiringbyMultimodalRecurrentNeuralNetworks[11]forimagecaptioning,weproposeanend- to-endDNNarchitecture,AuxiliaryMultimodalLSTM (am-LSTM),forAVSRandlipreading. The am-LSTMovercomesthetwomainweaknessesmentionedbefore. ItiscomposedofLSTMs,pro- jection and recognition, and is trained once. Therefore the modal fusion and temporal fusion are accomplishedatthesametime,i.e. themodalandtemporalfusionaremixedandcombinedinterms ofbalanceofthetwofusionprocesses. ToavoidoverfittingandmaketheDNNconvergefaster,we usetheconnectionstrategylikeDeepResidualLearning[8]–auxiliaryconnection. Earlystopping [20] and dropout [25] regularization are used to fight against overfitting as well. Because of the CNNandLSTM,am-LSTMisalsoatypeofwell-knownCNN-RNNarchitecture. WeconductedtheexperimentsinthreeAVSRdatasets: AVLetters(Pattersonetal.,2002),AVLet- ters2 (Cox et al., 2008) and AVDigits (Di Hu et al., 2015). The results suggest that am-LSTM is betterthanclassicAVSRmodelsandsomedeepmodelsbefore. 2 Relatedwork Traditional AVSR Systems. Humans understand the multimodal world in a seemingly effortless manner, although there are vast information processing resources dedicated to the corresponding tasksbythebrain[12]. Whereasitisdifficultforcomputerstounderstandmultimodalinformation, because different modalities have their character in nature, same information can be expressed in differentmodalitiesinverydissimilarway,letalonedistinctinformation. Audioisonedimensional temporalsignal,videoisthreedimensionaltemporalsignalandtextcarriessemanticinformation. Multimodal fusion is the most important part of multimodal learning. There exists three kinds of fusion strategies: early fusion, late fusion and hybrid fusion. For early fusion, it suffices to con- catenateallmonomodalfeaturesintoasingleaggregatemultimodaldescriptor. Inlatefusion,each modalityisclassifiedindependently. Intergrationisdoneatthedecisionlevelandisusuallybased onheuristicrules. Hybridfusionliesin-betweenearlyandlatefusionmethodsandarespecifically gearedtowardsmodelingmultimodaltime-evolvingdata[12][4]. TherepresentativeworkintheearlyyearsinthisfieldismultistreamHMMs(mHMMs)[16]which were confirmed adaptability and flexibility for modeling sequence and temporal data [21]. But probabilistic models have some explicit limitations, especially strong priori. Researchers are also interestedinthemannerofmappingdataindissimilarspacetoonespacejointly. Deep Learning for AVSR. Deep learning provides a powerful toolkit for machine learning. In AVSR,DNNalsodisplaysobviousincreasementofaccuracy.[18]usespre-trainedCNNtoextracted visualfeaturesanddenoisingautoencoderstoimproveauralfeatures,andthenmHMMstodofusion andclassfication. AsImentionedin1,itmainlyutilizesthefeatureextractionofDNN. Some other methods are multimodal fusion based on unsupervised DNN. One type is based on DAEs, the representative work is MDAEs [17]. Others based on Boltzmann Machines, [23] uses MDBN,[24]usesMDBM.Generally,DAEs-basedmodelsareeasytotrained,butlackoftheoretical supportandflexibility. RBM-basedmodelsarehardtotrainbecauseofthepartialfunction[5],but sufficientsupportfromprobabilisticmodelsandsimpletoextend. 2 3 TheProposedModel The am-LSTM aims at fusing the audio-visual data at the same time, considering the modal fu- sion, temporal fusion and the connection between frames simultaneously. In this section, we will introducetheam-LSTM,andshowitssimplicityandextensity. 3.1 SimpleLSTM There are two widely known issues with properly training vanilla RNN, the vanishing and the ex- ploding gradient [19]. Without some training tricks in training RNN, LSTM uses gates to avoid gradient vanishing and exploding. A typical LSTM often has 3 gates: input gate, output gate and forget gate which slows down the disappearance of past information and makes Backpropagation throughTime(BPTT)easier. Thegatesworkasfollows: i =sigmoid(Wix +Uih +bi) (1) t t t−1 f =sigmoid(Wfx +Ufh +bf) (2) t t t−1 z =tanh(Wzx +Uzh +bz) (3) t t t−1 c =z i +c f (4) t t t t−1 t o =sigmoid(Wox +Uoh +bo) (5) t t t−1 h =c tanh(o ) (6) t t t Figure1: SimpleLSTM 3.2 am-LSTM Theam-LSTMisaextensionofLSTMwhichcontainstwoLSTMsandsomecomponents. Thetwo LSTMsarevideoLSTMandaudioLSTM,thefundamentalarchitectureissameassimpleLSTM, theformulaeofvideoLSTMis(7)∼(12)(thesimilarformulaecanbedrawnforthaaudioLSTM). iv =sigmoid(Wiv +Uihv +bi) (7) t v t v t−1 v fv =sigmoid(Wfv +Ufhv +bf) (8) t v t v t−1 v 3 Figure2: am-LSTM zv =tanh(Wzv +Uzhv +bz) (9) t v t v t−1 v cv =zviv+cv fv (10) t t t t−1 t ov =sigmoid(Wov +Uohv +bo) (11) t v t v t−1 v hv =cvtanh(ov) (12) t t t AftervideoLSTMandaudioLSTM,datawillbeprojectedintosamedimensionalspacebyapro- jectionlayerandaactivationfunctionthereby. Thiswillalsomakethemodelmorenonlinear. f =g(Pvv +Paa ) (13) t t t t t W andW areprojectionmatrixtrainedintheDNN,gistheactivationfunction. 1 2 2 g(x)=tanh( x) (14) 3 The features through video LSTM, audio LSTM and projection could be regarded as well-fused features. The influence of different frames in a single video is summed in both video and audio modality. Then a classical Multi-layer Perceptron (MLP) with batch-normalization is used as a classifier. Thetraininglossissquaredmulti-labelmarginloss. However,theexperimentshowedthatoverfittingisverystronginthearchitectureaforementioned. Hence, weintroduceauxiliaryconnectionwhichacceleratestheconvergenceandpreventsoverfit- ting. The auxiliary connection lies after video LSTM and audio LSTM, then data summed and mappedtothetargetspace. Inthetraining,therearethreepartstakenintoaccount: themainloss, thevideoauxiliarylossandtheaudioauxiliaryloss. The implication here is minimizing the audio-visual loss, video loss and audio loss together. The auxiliarynetworkshelpthemainnetworksachieveawin-winsituation. LikeResNet,theauxiliary networks convey information from original networks, which make am-LSTM consider rich and hierarchyinformation. 4 Figure3: Batchandloss. Wecanseethatlossisstoppedinabout0.2sincestrongoverfitting. Figure4: Thedetailsofauxiliaryconnection. 5 3.3 Trainingam-LSTM Totrainouram-LSTMmodel,weadoptthecombinationofsquaredmulti-labelmarginloss. n 1 (cid:88) loss(x,y)= max{0,(1−x +y )2} (15) n i i i=1 Whereloss(x,y)isthesquaredmulti-labelmarginloss,xandyarepredictionandtarget. Loss=loss +αlossvideo +βlossaudio mian auxilary auxilary n 1 (cid:88) = [max{0,(1−x +y)2}+αmax{0,(1−xv +y)2}+βmax{0,(1−xa +y)2}] n main aux aux i=1 (16) WhereLossistherealtraininglossofam-LSTM,loss isthelossofthemainpart,lossvideo main auxliary and lossaudio are the loss of auxiliary part. α and β are the hyperparameters of the influence axuliary fromauxiliarypart. Here,wechooseα=β =0.2. Andyistheclassificationtarget. Theauxiliary connectionisfreeenoughsothattheam-LSTMisflexibleandextensive. 4 Experiments Inthissection,weintroduceourexperimentsdetailsincludingdatasets,datapre-processing,imple- mentationdetailsandsomeresults, indicatingthattheam-LSTMisarobust, well-performingand flexiblemodelforAVSRandlikewiselipreading. 4.1 Datasets We conducted our experiments in three datasets: AVLetters (Patterson et al., 2002), AVLetters2 (Coxetal.,2008)andAVDigits(DiHuetal.,2015). Datasets Speakers Content Times Miscellaneous AVLetters 10 A-Z 3 pre-extracted AVLetter2 5 A-Z 7 prvioussplit AVDigits 6 0-9 9 / Table1: Detailsofthedatasets 4.2 DataPre-processing Ifthevideoandaudioarenotsplittedbefore,wesplitthemintovideoandaudioandmakevideoand audiothesamelengthbytruncationorcompletion. Eventually,centeringthedata. VideoPre-processing. Firstly,theViola-Jonesalgorithm[28]isemployedtoextracttheRegion-of- Interestsurroundingthemouth. Aftertheregionisresizedto224×224pixels,pre-trainedVGG-16 [22] is the tool to extract the image features. We use the features of last fully connected layer. Finally,reducethemto100principalcomponentswithPCAwhitening. Audio Pre-processing. The features of audio signal is extracted as spectrogram with 20ms Ham- mingwindowand10msoverlap. With251pointsofFastFourierTransformand50principalcom- ponentsbyPCAthespectralcoefficientvectorisregradedastheaudiofeatures. DataAugmentation. 4contiguousaudioframescorrespondto1videoframeineachtimestep. We lettheauralandvisualfeaturesineveryvideosimultaneouslyshift10framesupanddownrandomly todoublethedata. Asaresult,themodelwillhavebettergeneralisationintimedomain. 6 4.3 ImplementationDetails The am-LSTM has two general LSTMs with dropout mapping 100 dimensional data to 50. Since thevideoandaudiofeatureshavethesamedimension,theprojectionisaidentitymatrixthatcould beomitted. TheMLPinmainphasehasthreelayerswithbatch-normalization,togetherwithReLU activation function. The auxiliary phase has simple structure mapping 50 dimensional data to 10 classespreparedforclassification. Asmentioned,α=β =0.2,thetraininglossis: n 1 (cid:88) Loss= [max{0,(1−x +y)2}+0.2max{0,(1−xv +y)2}+0.2max{0,(1−xa +y)2}] n main aux aux i=1 (17) The am-LSTM is able to be trained and tested once. No redundant training process is needed. Lipreadingistreatedascrossmodalityspeechrecognition,thatis,inthetrainingphase,videoand audiomodalitiesarebothneeded,whereasinthetestingphase,onlyvideomodalityispresented. 4.4 Results TheEvaluationonam-LSTMissplittedintotwoparts: AVSRtaskandcrossmodalitylipreading. AVSR is evaluated in two modalities and also trained in two modalities. However cross modality lipreadingisevaluatedinvideobuttrainedinvideoandaudioboth. OurexperimentsonAVSRis conductedinAVLetters2andAVDigits,crossmodalitylipreadingisconductedinAVLetters. 4.4.1 Audio-visualSpeechRecognition AVSR is the main task of our work, we conducted the experiments on AVLetters2 and AVDigits, the comparison methods are MDBN, MDAEs and RTMRBM. The results indicate that am-LSTM ismuchbetterthansuchmodels. Datasets Model MeanAccuracy MDBN[23] 54.10% MDAE[17] 67.89% AVLetter2 RTMRBM(DiHuetal.,2015) 74.77% am-LSTM 89.11% MDBN[23] 55.00% MDAE[17] 66.74% AVDigits RTMRBM(DiHuetal.,2015) 71.77% am-LSTM 85.23% Table 2: AVSR performance on AVLetters2 and AVDigits. The result indicates that our model performsbetterthanMDBN,MDAEsandRTMRBM. 4.4.2 CrossModalityLipreading Cross modality lipreading is the secondary task. As mentioned before, we trained am-LSTM in bothmodalitiesbutevaluatedinvisualmodalityonly. Theexperimentshavetwomodes: onlyvideo andcrossmodality,whichshowthesuperiorityofcrossmodalitylipreading. Intheexperimentsin crossmodalitylipreading,am-LSTMperformsmuchbetterthanMDAEs,CRBM,RTMRBM.The lipreadinghereisword-levellipreading. 5 Conclusion We proposed an end-to-end deep model for AVSR and lipreading which increases mean accuracy obviously. Our experiments suggested that am-LSTM performs much better than other models in AVSR and cross modality lipreading. The benefits of am-LSTM are trained and tested once; 7 Mode Model Mean Accuracy MultiscaleSpatialAnalysis[13] 44.60% OnlyVideo LocalBinaryPattern[29] 58.85% MDAEs[17] 64.21% CRBM[2] 65.78% CrossModality RTMRBM(DiHuetal.,2015) 66.21% am-LSTM 88.83% Table 3: Cross modality lipreading performance. The experiment suggest that cross modality lipreadingisbetterthansinglemodality. Andresultofourmethodperformsmuchbetterthanother model. extensibility and flexibility. There are no other training processes needed in training am-LSTM. Duetoam-LSTMissimple,itcouldbeatooltomodelfusioninformationefficiently. Meanwhile, am-LSTMconsiderstemporalconnection,thusitissuitableforsequencefeatures. Inthefuture,we plantoapplyam-LSTMinothermultimodaltemporaltasksandmakeitmoreflexible. Acknowledgments IwouldliketoacknowledgemyfriendsYiSun,WenqiangYangandPengXieforhelpfulsupport. I wouldliketothankthedevelopersofTorch[6]andTensorflow[1]. Iwouldalsothankmylaboratory forprovidingcomputationalresources. References [1] MartınAbadi,AshishAgarwal,PaulBarham,EugeneBrevdo,ZhifengChen,CraigCitro,GregSCor- rado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneousdistributedsystems. arXivpreprintarXiv:1603.04467,2016. [2] MohamedRAmer,BehjatSiddiquie,SaadKhan,AjayDivakaran,andHarpreetSawhney. Multimodal fusionusingdynamichybridmodels. pages556–563,2014. [3] DarioAmodei, RishitaAnubhai, EricBattenberg, CarlCase, JaredCasper, BryanCatanzaro, Jingdong Chen,MikeChrzanowski,AdamCoates,GregDiamos,etal. Deepspeech2: End-to-endspeechrecog- nitioninenglishandmandarin. arXivpreprintarXiv:1512.02595,2015. [4] PradeepKAtrey,MAnwarHossain,AbdulmotalebElSaddik,andMohanSKankanhalli. Multimodal fusionformultimediaanalysis:asurvey. Multimediasystems,16(6):345–379,2010. [5] YoshuaBengio,IanJGoodfellow,andAaronCourville. Deeplearning. AnMITPressbookinprepara- tion.Draftchaptersavailableathttp://www.iro.umontreal.ca/ bengioy/dlbook,2015. [6] Ronan Collobert, Koray Kavukcuoglu, and Cle´ment Farabet. Torch7: A matlab-like environment for machinelearning. InBigLearn,NIPSWorkshop,numberEPFL-CONF-192376,2011. [7] Alex Graves, Santiago Ferna´ndez, Faustino Gomez, and Ju¨rgen Schmidhuber. Connectionist temporal classification: labellingunsegmentedsequencedatawithrecurrentneuralnetworks. InProceedingsof the23rdinternationalconferenceonMachinelearning,pages369–376.ACM,2006. [8] KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun.Deepresiduallearningforimagerecognition. arXivpreprintarXiv:1512.03385,2015. [9] SeppHochreiterandJu¨rgenSchmidhuber. Longshort-termmemory. Neuralcomputation, 9(8):1735– 1780,1997. [10] AlexKrizhevsky,IlyaSutskever,andGeoffreyEHinton.Imagenetclassificationwithdeepconvolutional neuralnetworks. InAdvancesinneuralinformationprocessingsystems,pages1097–1105,2012. [11] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. Deep captioning with multimodalrecurrentneuralnetworks(m-rnn). arXivpreprintarXiv:1412.6632,2014. [12] PetrosMaragos,AlexandrosPotamianos,andPatrickGros. Multimodalprocessingandinteraction: au- dio,video,text,volume33. SpringerScience&BusinessMedia,2008. [13] IainMatthews,TimothyFCootes,JAndrewBangham,StephenCox,andRichardHarvey. Extractionof visualfeaturesforlipreading. IEEETransactionsonPatternAnalysisandMachineIntelligence,24(2): 198–213,2002. 8 [14] Youssef Mroueh, Etienne Marcheret, and Vaibhava Goel. Deep multimodal learning for audio-visual speechrecognition. pages2130–2134,2015. [15] AudreyRNathandMichaelSBeauchamp. Aneuralbasisforinterindividualdifferencesinthemcgurk effect,amultisensoryspeechillusion. NeuroImage,59(1):781–787,2012. [16] AraVNefian,LuhongLiang,XiaoboPi,XiaoxingLiu,andKevinMurphy. Dynamicbayesiannetworks for audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing, 2002(11): 1–15,2002. [17] JiquanNgiam,AdityaKhosla,MingyuKim,JuhanNam,HonglakLee,andAndrewYNg. Multimodal deeplearning.InProceedingsofthe28thinternationalconferenceonmachinelearning(ICML-11),pages 689–696,2011. [18] KuniakiNoda,YukiYamaguchi,KazuhiroNakadai,HiroshiGOkuno,andTetsuyaOgata. Audio-visual speechrecognitionusingdeeplearning. AppliedIntelligence,42(4):722–737,2015. [19] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. ICML(3),28:1310–1318,2013. [20] LutzPrechelt.Automaticearlystoppingusingcrossvalidation:quantifyingthecriteria.NeuralNetworks, 11(4):761–767,1998. [21] LawrenceRRabiner.Atutorialonhiddenmarkovmodelsandselectedapplicationsinspeechrecognition. ProceedingsoftheIEEE,77(2):257–286,1989. [22] KarenSimonyanandAndrewZisserman. Verydeepconvolutionalnetworksforlarge-scaleimagerecog- nition. arXivpreprintarXiv:1409.1556,2014. [23] Nitish Srivastava and Ruslan Salakhutdinov. Learning representations for multimodal data with deep beliefnets. InInternationalconferenceonmachinelearningworkshop,2012. [24] NitishSrivastavaandRuslanRSalakhutdinov. Multimodallearningwithdeepboltzmannmachines. In Advancesinneuralinformationprocessingsystems,pages2222–2230,2012. [25] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: asimplewaytopreventneuralnetworksfromoverfitting. JournalofMachineLearningRe- search,15(1):1929–1958,2014. [26] IlyaSutskever,OriolVinyals,andQuocVLe. Sequencetosequencelearningwithneuralnetworks. In Advancesinneuralinformationprocessingsystems,pages3104–3112,2014. [27] Satoshi Tamura, Hiroshi Ninomiya, Norihide Kitaoka, Shin Osuga, Yurie Iribe, Kazuya Takeda, and SatoruHayamizu. Audio-visualspeechrecognitionusingdeepbottleneckfeaturesandhigh-performance lipreading. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference(APSIPA),pages575–582.IEEE,2015. [28] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In ComputerVisionandPatternRecognition,2001.CVPR2001.Proceedingsofthe2001IEEEComputer SocietyConferenceon,volume1,pagesI–511.IEEE,2001. [29] GuoyingZhao,MarkBarnard,andMattiPietikainen. Lipreadingwithlocalspatiotemporaldescriptors. IEEETransactionsonMultimedia,11(7):1254–1265,2009. 9