ebook img

Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading PDF

0.37 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading

Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading ChunlinTian,WeijunJi UniversityoftheChineseAcademyofSciences,19AYuquanlu,Beijing,100049,P.R.China 7 {tianchunlin15, jiweijun15}@mails.ucas.ac.cn 1 0 2 r Abstract a M TheAduio-visualSpeechRecognition(AVSR)whichemploysboththevideoand 7 audioinformationtodoAutomaticSpeechRecognition(ASR)isoneoftheappli- 1 cationofmultimodalleaningmakingASRsystemmorerobustandaccuracy. The traditionalmodelsusuallytreatedAVSRasinferenceorprojectionbutstrictprior ] limits its ability. As the revival of deep learning, Deep Neural Networks (DNN) V becomes an important toolkit in many traditional classification tasks including C ASR,imageclassification,naturallanguageprocessing. SomeDNNmodelswere . usedinAVSRlikeMultimodalDeepAutoencoders(MDAEs), MultimodalDeep s c BeliefNetwork(MDBN)andMultimodalDeepBoltzmannMachine(MDBM)that [ actuallyworkbetterthantraditionalmethods. However,suchDNNmodelshave severalshortcomings: (1)Theydon’tbalancethemodalfusionandtemporalfu- 2 sion, or even haven’t temporal fusion; (2)The architecture of these models isn’t v end-to-end, the training and testing getting cumbersome. We propose a DNN 4 2 model, Auxiliary Multimodal LSTM (am-LSTM), to overcome such weakness. 2 Theam-LSTMcouldbetrainedandtestedinonetime,alternativelyeasytotrain 4 andpreventingoverfittingautomatically. Theextensibilityandflexibilityarealso 0 take into consideration. The experiments shows that am-LSTM is much better . than traditional methods and other DNN models in three datasets: AVLetters, 1 AVLetters2,AVDigits. 0 7 1 : 1 Introduction v i X AutomaticSpeechRecognition(ASR)hasbeeninvestigatedoverseveralyears,andthereisawealth r of literature. The recent progress is Deep Speech2 [3], which utilizes deep Convolution Neural a Network (CNN)[10], LSTM[9] and CTC [7], and sequence-to-sequence models [26]. Although ASR achieved excellent result, it had some intrinsical problems including insufficient tolerance of noiseanddisturbance. Besides,someillusionoccurswhentheauditorycomponentofonesoundis pairedwiththevisualcomponentofanothersound,leadingtotheperceptionofathirdsound[15]. ThiswasdescribedasMcGurkeffect(McGurk&MacDonald1976). Benefitedfromthemultimodallearning,Audio-visualSpeechRecognition(AVSR)isasupplement toASRthatmixesaudioandvisualinformationtogether, andamountofworkverifiedthatAVSR strengthenedtheASRsystem[17][2]. WhenitcomestoAVSR,thecorepartismultimodallearn- ing. Intheearlydecades,manymodelsaimedtofusethemultimodaldatamorerepresentativeand discriminative,whichcontainmultimodalextensionsofHiddenMarkovModels(HMMs)andsome statistical models. But strong prior assumption limits such models. As the revival of Neural Net- works,somedeepmodelswereproposedinmultimodallearning. Differentfromtraditionalmodels, deepmodelsusuallyhavetwomainaspects: (1) Many researchers think of Deep Neural Networks (DNN) as performing a kind of good repre- sentation learning [5], that is, the ability to extract feature of DNN can be exploited for varieties 1 of tasks, especially the CNN for image feature extraction. For aural information, it is refined by some deep models including Deep Belief Network (DBN), Deep Autoencoder (DAE), Restricted BoltzmannMachines(RBMs)andDeepBottleneckFeatures(DBNF)[14][18][27]. (2) The other merit of deep models is the well-performed fusion, conquering the biggest disad- vantage of traditional methods. Multimodal DBN (MDBN)[23], Multimodal DAEs (MDAEs)[17], MultimodalDeepBoltzmannMachine(MDBM)[24]areallunsupervisedDNNthatperformfusion. However,theaforementioneddeepmodelshavetwoprimaryshortcomings: (1)Theydon’tbalancethemodalfusionandtemporalfusion,orevenhaven’ttemporalfusion.Many methodssimplyconcatenatethefeaturesofframesinasinglevideotodotemporalfusion. Besides, inthemodalfusion,suchmethodsdon’tconsiderthecorrelationamongdifferentframes. (2)Thearchitectureofthesemodelsisn’tend-to-end,thetrainingandtestinggettingcumbersome. InspiringbyMultimodalRecurrentNeuralNetworks[11]forimagecaptioning,weproposeanend- to-endDNNarchitecture,AuxiliaryMultimodalLSTM (am-LSTM),forAVSRandlipreading. The am-LSTMovercomesthetwomainweaknessesmentionedbefore. ItiscomposedofLSTMs,pro- jection and recognition, and is trained once. Therefore the modal fusion and temporal fusion are accomplishedatthesametime,i.e. themodalandtemporalfusionaremixedandcombinedinterms ofbalanceofthetwofusionprocesses. ToavoidoverfittingandmaketheDNNconvergefaster,we usetheconnectionstrategylikeDeepResidualLearning[8]–auxiliaryconnection. Earlystopping [20] and dropout [25] regularization are used to fight against overfitting as well. Because of the CNNandLSTM,am-LSTMisalsoatypeofwell-knownCNN-RNNarchitecture. WeconductedtheexperimentsinthreeAVSRdatasets: AVLetters(Pattersonetal.,2002),AVLet- ters2 (Cox et al., 2008) and AVDigits (Di Hu et al., 2015). The results suggest that am-LSTM is betterthanclassicAVSRmodelsandsomedeepmodelsbefore. 2 Relatedwork Traditional AVSR Systems. Humans understand the multimodal world in a seemingly effortless manner, although there are vast information processing resources dedicated to the corresponding tasksbythebrain[12]. Whereasitisdifficultforcomputerstounderstandmultimodalinformation, because different modalities have their character in nature, same information can be expressed in differentmodalitiesinverydissimilarway,letalonedistinctinformation. Audioisonedimensional temporalsignal,videoisthreedimensionaltemporalsignalandtextcarriessemanticinformation. Multimodal fusion is the most important part of multimodal learning. There exists three kinds of fusion strategies: early fusion, late fusion and hybrid fusion. For early fusion, it suffices to con- catenateallmonomodalfeaturesintoasingleaggregatemultimodaldescriptor. Inlatefusion,each modalityisclassifiedindependently. Intergrationisdoneatthedecisionlevelandisusuallybased onheuristicrules. Hybridfusionliesin-betweenearlyandlatefusionmethodsandarespecifically gearedtowardsmodelingmultimodaltime-evolvingdata[12][4]. TherepresentativeworkintheearlyyearsinthisfieldismultistreamHMMs(mHMMs)[16]which were confirmed adaptability and flexibility for modeling sequence and temporal data [21]. But probabilistic models have some explicit limitations, especially strong priori. Researchers are also interestedinthemannerofmappingdataindissimilarspacetoonespacejointly. Deep Learning for AVSR. Deep learning provides a powerful toolkit for machine learning. In AVSR,DNNalsodisplaysobviousincreasementofaccuracy.[18]usespre-trainedCNNtoextracted visualfeaturesanddenoisingautoencoderstoimproveauralfeatures,andthenmHMMstodofusion andclassfication. AsImentionedin1,itmainlyutilizesthefeatureextractionofDNN. Some other methods are multimodal fusion based on unsupervised DNN. One type is based on DAEs, the representative work is MDAEs [17]. Others based on Boltzmann Machines, [23] uses MDBN,[24]usesMDBM.Generally,DAEs-basedmodelsareeasytotrained,butlackoftheoretical supportandflexibility. RBM-basedmodelsarehardtotrainbecauseofthepartialfunction[5],but sufficientsupportfromprobabilisticmodelsandsimpletoextend. 2 3 TheProposedModel The am-LSTM aims at fusing the audio-visual data at the same time, considering the modal fu- sion, temporal fusion and the connection between frames simultaneously. In this section, we will introducetheam-LSTM,andshowitssimplicityandextensity. 3.1 SimpleLSTM There are two widely known issues with properly training vanilla RNN, the vanishing and the ex- ploding gradient [19]. Without some training tricks in training RNN, LSTM uses gates to avoid gradient vanishing and exploding. A typical LSTM often has 3 gates: input gate, output gate and forget gate which slows down the disappearance of past information and makes Backpropagation throughTime(BPTT)easier. Thegatesworkasfollows: i =sigmoid(Wix +Uih +bi) (1) t t t−1 f =sigmoid(Wfx +Ufh +bf) (2) t t t−1 z =tanh(Wzx +Uzh +bz) (3) t t t−1 c =z i +c f (4) t t t t−1 t o =sigmoid(Wox +Uoh +bo) (5) t t t−1 h =c tanh(o ) (6) t t t Figure1: SimpleLSTM 3.2 am-LSTM Theam-LSTMisaextensionofLSTMwhichcontainstwoLSTMsandsomecomponents. Thetwo LSTMsarevideoLSTMandaudioLSTM,thefundamentalarchitectureissameassimpleLSTM, theformulaeofvideoLSTMis(7)∼(12)(thesimilarformulaecanbedrawnforthaaudioLSTM). iv =sigmoid(Wiv +Uihv +bi) (7) t v t v t−1 v fv =sigmoid(Wfv +Ufhv +bf) (8) t v t v t−1 v 3 Figure2: am-LSTM zv =tanh(Wzv +Uzhv +bz) (9) t v t v t−1 v cv =zviv+cv fv (10) t t t t−1 t ov =sigmoid(Wov +Uohv +bo) (11) t v t v t−1 v hv =cvtanh(ov) (12) t t t AftervideoLSTMandaudioLSTM,datawillbeprojectedintosamedimensionalspacebyapro- jectionlayerandaactivationfunctionthereby. Thiswillalsomakethemodelmorenonlinear. f =g(Pvv +Paa ) (13) t t t t t W andW areprojectionmatrixtrainedintheDNN,gistheactivationfunction. 1 2 2 g(x)=tanh( x) (14) 3 The features through video LSTM, audio LSTM and projection could be regarded as well-fused features. The influence of different frames in a single video is summed in both video and audio modality. Then a classical Multi-layer Perceptron (MLP) with batch-normalization is used as a classifier. Thetraininglossissquaredmulti-labelmarginloss. However,theexperimentshowedthatoverfittingisverystronginthearchitectureaforementioned. Hence, weintroduceauxiliaryconnectionwhichacceleratestheconvergenceandpreventsoverfit- ting. The auxiliary connection lies after video LSTM and audio LSTM, then data summed and mappedtothetargetspace. Inthetraining,therearethreepartstakenintoaccount: themainloss, thevideoauxiliarylossandtheaudioauxiliaryloss. The implication here is minimizing the audio-visual loss, video loss and audio loss together. The auxiliarynetworkshelpthemainnetworksachieveawin-winsituation. LikeResNet,theauxiliary networks convey information from original networks, which make am-LSTM consider rich and hierarchyinformation. 4 Figure3: Batchandloss. Wecanseethatlossisstoppedinabout0.2sincestrongoverfitting. Figure4: Thedetailsofauxiliaryconnection. 5 3.3 Trainingam-LSTM Totrainouram-LSTMmodel,weadoptthecombinationofsquaredmulti-labelmarginloss. n 1 (cid:88) loss(x,y)= max{0,(1−x +y )2} (15) n i i i=1 Whereloss(x,y)isthesquaredmulti-labelmarginloss,xandyarepredictionandtarget. Loss=loss +αlossvideo +βlossaudio mian auxilary auxilary n 1 (cid:88) = [max{0,(1−x +y)2}+αmax{0,(1−xv +y)2}+βmax{0,(1−xa +y)2}] n main aux aux i=1 (16) WhereLossistherealtraininglossofam-LSTM,loss isthelossofthemainpart,lossvideo main auxliary and lossaudio are the loss of auxiliary part. α and β are the hyperparameters of the influence axuliary fromauxiliarypart. Here,wechooseα=β =0.2. Andyistheclassificationtarget. Theauxiliary connectionisfreeenoughsothattheam-LSTMisflexibleandextensive. 4 Experiments Inthissection,weintroduceourexperimentsdetailsincludingdatasets,datapre-processing,imple- mentationdetailsandsomeresults, indicatingthattheam-LSTMisarobust, well-performingand flexiblemodelforAVSRandlikewiselipreading. 4.1 Datasets We conducted our experiments in three datasets: AVLetters (Patterson et al., 2002), AVLetters2 (Coxetal.,2008)andAVDigits(DiHuetal.,2015). Datasets Speakers Content Times Miscellaneous AVLetters 10 A-Z 3 pre-extracted AVLetter2 5 A-Z 7 prvioussplit AVDigits 6 0-9 9 / Table1: Detailsofthedatasets 4.2 DataPre-processing Ifthevideoandaudioarenotsplittedbefore,wesplitthemintovideoandaudioandmakevideoand audiothesamelengthbytruncationorcompletion. Eventually,centeringthedata. VideoPre-processing. Firstly,theViola-Jonesalgorithm[28]isemployedtoextracttheRegion-of- Interestsurroundingthemouth. Aftertheregionisresizedto224×224pixels,pre-trainedVGG-16 [22] is the tool to extract the image features. We use the features of last fully connected layer. Finally,reducethemto100principalcomponentswithPCAwhitening. Audio Pre-processing. The features of audio signal is extracted as spectrogram with 20ms Ham- mingwindowand10msoverlap. With251pointsofFastFourierTransformand50principalcom- ponentsbyPCAthespectralcoefficientvectorisregradedastheaudiofeatures. DataAugmentation. 4contiguousaudioframescorrespondto1videoframeineachtimestep. We lettheauralandvisualfeaturesineveryvideosimultaneouslyshift10framesupanddownrandomly todoublethedata. Asaresult,themodelwillhavebettergeneralisationintimedomain. 6 4.3 ImplementationDetails The am-LSTM has two general LSTMs with dropout mapping 100 dimensional data to 50. Since thevideoandaudiofeatureshavethesamedimension,theprojectionisaidentitymatrixthatcould beomitted. TheMLPinmainphasehasthreelayerswithbatch-normalization,togetherwithReLU activation function. The auxiliary phase has simple structure mapping 50 dimensional data to 10 classespreparedforclassification. Asmentioned,α=β =0.2,thetraininglossis: n 1 (cid:88) Loss= [max{0,(1−x +y)2}+0.2max{0,(1−xv +y)2}+0.2max{0,(1−xa +y)2}] n main aux aux i=1 (17) The am-LSTM is able to be trained and tested once. No redundant training process is needed. Lipreadingistreatedascrossmodalityspeechrecognition,thatis,inthetrainingphase,videoand audiomodalitiesarebothneeded,whereasinthetestingphase,onlyvideomodalityispresented. 4.4 Results TheEvaluationonam-LSTMissplittedintotwoparts: AVSRtaskandcrossmodalitylipreading. AVSR is evaluated in two modalities and also trained in two modalities. However cross modality lipreadingisevaluatedinvideobuttrainedinvideoandaudioboth. OurexperimentsonAVSRis conductedinAVLetters2andAVDigits,crossmodalitylipreadingisconductedinAVLetters. 4.4.1 Audio-visualSpeechRecognition AVSR is the main task of our work, we conducted the experiments on AVLetters2 and AVDigits, the comparison methods are MDBN, MDAEs and RTMRBM. The results indicate that am-LSTM ismuchbetterthansuchmodels. Datasets Model MeanAccuracy MDBN[23] 54.10% MDAE[17] 67.89% AVLetter2 RTMRBM(DiHuetal.,2015) 74.77% am-LSTM 89.11% MDBN[23] 55.00% MDAE[17] 66.74% AVDigits RTMRBM(DiHuetal.,2015) 71.77% am-LSTM 85.23% Table 2: AVSR performance on AVLetters2 and AVDigits. The result indicates that our model performsbetterthanMDBN,MDAEsandRTMRBM. 4.4.2 CrossModalityLipreading Cross modality lipreading is the secondary task. As mentioned before, we trained am-LSTM in bothmodalitiesbutevaluatedinvisualmodalityonly. Theexperimentshavetwomodes: onlyvideo andcrossmodality,whichshowthesuperiorityofcrossmodalitylipreading. Intheexperimentsin crossmodalitylipreading,am-LSTMperformsmuchbetterthanMDAEs,CRBM,RTMRBM.The lipreadinghereisword-levellipreading. 5 Conclusion We proposed an end-to-end deep model for AVSR and lipreading which increases mean accuracy obviously. Our experiments suggested that am-LSTM performs much better than other models in AVSR and cross modality lipreading. The benefits of am-LSTM are trained and tested once; 7 Mode Model Mean Accuracy MultiscaleSpatialAnalysis[13] 44.60% OnlyVideo LocalBinaryPattern[29] 58.85% MDAEs[17] 64.21% CRBM[2] 65.78% CrossModality RTMRBM(DiHuetal.,2015) 66.21% am-LSTM 88.83% Table 3: Cross modality lipreading performance. The experiment suggest that cross modality lipreadingisbetterthansinglemodality. Andresultofourmethodperformsmuchbetterthanother model. extensibility and flexibility. There are no other training processes needed in training am-LSTM. Duetoam-LSTMissimple,itcouldbeatooltomodelfusioninformationefficiently. Meanwhile, am-LSTMconsiderstemporalconnection,thusitissuitableforsequencefeatures. Inthefuture,we plantoapplyam-LSTMinothermultimodaltemporaltasksandmakeitmoreflexible. Acknowledgments IwouldliketoacknowledgemyfriendsYiSun,WenqiangYangandPengXieforhelpfulsupport. I wouldliketothankthedevelopersofTorch[6]andTensorflow[1]. Iwouldalsothankmylaboratory forprovidingcomputationalresources. References [1] MartınAbadi,AshishAgarwal,PaulBarham,EugeneBrevdo,ZhifengChen,CraigCitro,GregSCor- rado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneousdistributedsystems. arXivpreprintarXiv:1603.04467,2016. [2] MohamedRAmer,BehjatSiddiquie,SaadKhan,AjayDivakaran,andHarpreetSawhney. Multimodal fusionusingdynamichybridmodels. pages556–563,2014. [3] DarioAmodei, RishitaAnubhai, EricBattenberg, CarlCase, JaredCasper, BryanCatanzaro, Jingdong Chen,MikeChrzanowski,AdamCoates,GregDiamos,etal. Deepspeech2: End-to-endspeechrecog- nitioninenglishandmandarin. arXivpreprintarXiv:1512.02595,2015. [4] PradeepKAtrey,MAnwarHossain,AbdulmotalebElSaddik,andMohanSKankanhalli. Multimodal fusionformultimediaanalysis:asurvey. Multimediasystems,16(6):345–379,2010. [5] YoshuaBengio,IanJGoodfellow,andAaronCourville. Deeplearning. AnMITPressbookinprepara- tion.Draftchaptersavailableathttp://www.iro.umontreal.ca/ bengioy/dlbook,2015. [6] Ronan Collobert, Koray Kavukcuoglu, and Cle´ment Farabet. Torch7: A matlab-like environment for machinelearning. InBigLearn,NIPSWorkshop,numberEPFL-CONF-192376,2011. [7] Alex Graves, Santiago Ferna´ndez, Faustino Gomez, and Ju¨rgen Schmidhuber. Connectionist temporal classification: labellingunsegmentedsequencedatawithrecurrentneuralnetworks. InProceedingsof the23rdinternationalconferenceonMachinelearning,pages369–376.ACM,2006. [8] KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun.Deepresiduallearningforimagerecognition. arXivpreprintarXiv:1512.03385,2015. [9] SeppHochreiterandJu¨rgenSchmidhuber. Longshort-termmemory. Neuralcomputation, 9(8):1735– 1780,1997. [10] AlexKrizhevsky,IlyaSutskever,andGeoffreyEHinton.Imagenetclassificationwithdeepconvolutional neuralnetworks. InAdvancesinneuralinformationprocessingsystems,pages1097–1105,2012. [11] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. Deep captioning with multimodalrecurrentneuralnetworks(m-rnn). arXivpreprintarXiv:1412.6632,2014. [12] PetrosMaragos,AlexandrosPotamianos,andPatrickGros. Multimodalprocessingandinteraction: au- dio,video,text,volume33. SpringerScience&BusinessMedia,2008. [13] IainMatthews,TimothyFCootes,JAndrewBangham,StephenCox,andRichardHarvey. Extractionof visualfeaturesforlipreading. IEEETransactionsonPatternAnalysisandMachineIntelligence,24(2): 198–213,2002. 8 [14] Youssef Mroueh, Etienne Marcheret, and Vaibhava Goel. Deep multimodal learning for audio-visual speechrecognition. pages2130–2134,2015. [15] AudreyRNathandMichaelSBeauchamp. Aneuralbasisforinterindividualdifferencesinthemcgurk effect,amultisensoryspeechillusion. NeuroImage,59(1):781–787,2012. [16] AraVNefian,LuhongLiang,XiaoboPi,XiaoxingLiu,andKevinMurphy. Dynamicbayesiannetworks for audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing, 2002(11): 1–15,2002. [17] JiquanNgiam,AdityaKhosla,MingyuKim,JuhanNam,HonglakLee,andAndrewYNg. Multimodal deeplearning.InProceedingsofthe28thinternationalconferenceonmachinelearning(ICML-11),pages 689–696,2011. [18] KuniakiNoda,YukiYamaguchi,KazuhiroNakadai,HiroshiGOkuno,andTetsuyaOgata. Audio-visual speechrecognitionusingdeeplearning. AppliedIntelligence,42(4):722–737,2015. [19] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. ICML(3),28:1310–1318,2013. [20] LutzPrechelt.Automaticearlystoppingusingcrossvalidation:quantifyingthecriteria.NeuralNetworks, 11(4):761–767,1998. [21] LawrenceRRabiner.Atutorialonhiddenmarkovmodelsandselectedapplicationsinspeechrecognition. ProceedingsoftheIEEE,77(2):257–286,1989. [22] KarenSimonyanandAndrewZisserman. Verydeepconvolutionalnetworksforlarge-scaleimagerecog- nition. arXivpreprintarXiv:1409.1556,2014. [23] Nitish Srivastava and Ruslan Salakhutdinov. Learning representations for multimodal data with deep beliefnets. InInternationalconferenceonmachinelearningworkshop,2012. [24] NitishSrivastavaandRuslanRSalakhutdinov. Multimodallearningwithdeepboltzmannmachines. In Advancesinneuralinformationprocessingsystems,pages2222–2230,2012. [25] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: asimplewaytopreventneuralnetworksfromoverfitting. JournalofMachineLearningRe- search,15(1):1929–1958,2014. [26] IlyaSutskever,OriolVinyals,andQuocVLe. Sequencetosequencelearningwithneuralnetworks. In Advancesinneuralinformationprocessingsystems,pages3104–3112,2014. [27] Satoshi Tamura, Hiroshi Ninomiya, Norihide Kitaoka, Shin Osuga, Yurie Iribe, Kazuya Takeda, and SatoruHayamizu. Audio-visualspeechrecognitionusingdeepbottleneckfeaturesandhigh-performance lipreading. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference(APSIPA),pages575–582.IEEE,2015. [28] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In ComputerVisionandPatternRecognition,2001.CVPR2001.Proceedingsofthe2001IEEEComputer SocietyConferenceon,volume1,pagesI–511.IEEE,2001. [29] GuoyingZhao,MarkBarnard,andMattiPietikainen. Lipreadingwithlocalspatiotemporaldescriptors. IEEETransactionsonMultimedia,11(7):1254–1265,2009. 9

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.