Acceptedforpublication,ICASSP2017 END-TO-END VISUAL SPEECH RECOGNITIONWITHLSTMS StavrosPetridis,Zuwei Li Maja Pantic ImperialCollegeLondon Imperial CollegeLondon/ Univ. ofTwente Dept. ofComputing,London,UK Dept. ofComputing,UK /EEMCS, Netherlands {sp104;zl4615}@imperial.ac.uk [email protected] 7 1 0 2 ABSTRACT a dynamic classifier, like Hidden Markov Models (HMMs) or Long-Short Term Memory (LSTM) recurrent neural net- n Traditional visual speech recognition systems consist of a two stages, feature extraction and classification. Recently, works,isusedtomodelthetemporalevolutionofthefeatures. J severaldeeplearningapproacheshavebeenpresentedwhich Recently, several deep learning approaches for visual 0 automaticallyextractfeaturesfromthemouthimagesandaim speech recognition have been presented. The vast major- 2 toreplacethefeatureextractionstage. However,researchon ity also follow a two stage approach where deep bottleneck ] jointlearningoffeaturesandclassificationisverylimited. In architectures are used for feature extraction. First, high di- V thiswork,wepresentanend-to-endvisualspeechrecognition mensionalfeaturesare extractedfromthe mouthROI which C systembasedonLong-ShortMemory(LSTM)networks. To are compressed to a low dimensional representation at the . thebestofourknowledge,thisisthefirstmodelwhichsimul- bottlenecklayerofadeepnetworkandthenfedtoaclassifier. s c taneously learns to extract features directly from the pixels Ngiametal. [4]appliedprincipalcomponentanalysis(PCA) [ and perform classification and also achieves state-of-the-art to the mouth ROI and trained a deep autoencoder to extract 1 performanceinvisualspeechclassification. Themodelcon- bottleneck features. The features from the entire utterance v sists of two streams which extractfeaturesdirectly from the were fed to a support vector machine ignoring the temporal 7 mouthanddifferenceimages,respectively. Thetemporaldy- dynamics of the speech. Ninomiya et al. [5] also applied 4 namicsineachstreamaremodelledbyanLSTMandthefu- PCA to the mouth ROIs and used a deep autoencoder to 8 sionofthetwostreamstakesplaceviaaBidirectionalLSTM extract bottleneck features but an HMM was used in order 5 0 (BLSTM). An absolute improvementof 9.7% over the base to take into account the temporal dynamics. Sui et al. [6] . line is reported on the OuluVS2 database, and 1.5% on the extractedlocalbinarypatternsfromthemouthROIandused 1 CUAVEdatabasewhencomparedwithothermethodswhich adeepautoencodertoreducetheirdimensionality. Then,the 0 7 useasimilarvisualfront-end. bottleneckfeatureswereconcatenatedwithDCTfeaturesand 1 fedtoanHMM.Asimilarapproachhasalsobeenfollowedin Index Terms— Visual Speech Recognition, Lipreading, : audiovisualspeech recognition[7, 8, 9] where a shared rep- v End-to-End Training, Long-Short Term Recurrent Neural i resentationoftheinputaudioandvisualfeaturesisextracted X Networks,DeepNetworks fromthebottlenecklayer. r a Fewworkshavealsobeenpresentedwhichextractbottle- 1. INTRODUCTION neckfeaturesdirectlyfromthepixels. Li[10]usedaconvo- lutionalneuralnetwork(CNN)inordertoextractbottleneck Speech is an audiovisual signal which consists of the audio featuresfrom dynamicrepresentationsof images, which are vocalisationandthecorrespondingmouthconfiguration. Al- fedtoanHMMforclassification. Inourpreviouswork[11], thoughmostoftheinformationiscarriedbytheaudiosignal, weextractedbottleneckfeaturesdirectlyfromtherawmouth the visual signal also carries complementary and redundant ROI using a deep feedforward network and then trained an information. This visual information, which is not affected LSTMforclassification. Nodaetal. [12]usedaCNNtopre- byacousticnoise,cansignificantlyimprovetheperformance dictthephonemethatcorrespondstoaninputmouthROI,and ofspeechrecognitioninnoisyenvironments. thenanHMMisusedtogetherwithaudiofeaturesinorderto Traditionally, visual speech recognition systems consist classifyanutterance. oftwostages,featureextractionfromthemouthregionofin- terest (ROI) and classification [1, 2, 3]. The most common Despite the success of deep learning methods in feature featureextractionapproachistheuseofadimensionalityre- extraction,workonend-to-endvisualspeechrecognitionhas duction/compressionmethod,withthemostpopularbeingthe beenverylimited. Tothebestofourknowledge,onlyWand Discrete Cosine Transform (DCT), which results in a com- et al. [13] developed an end-to-end system for lipreading. pact representation of the mouth ROI. In the second stage, The system consists of one feedforward layer followed by Acceptedforpublication,ICASSP2017 Softmax Target Classes BLSTM (250) LSTM (250) LSTM (250) 50 + 50 + Fig.2:ExampleofmouthROIextractionfromCUAVE 500 500 database,and1.5%ontheCUAVEdatabasewhencompared Encoding withothermethodswhichuseasimilarvisualfront-end. Layers 1000 1000 2. DATABASES 2000 2000 The databases used in this study are the OuluVS2 [16] and CUAVE[17]. TheOuluVS2contains52speakerssaying10 utterances, 3 times each, so in total there are 156 examples per utterance. The utterances are the following: “Excuse me”, “Goodbye”, “Hello”, “How are you”, “Nice to meet Image Diff Image you”, “See you”, “I am sorry”, “Thankyou”, “Have a good time”,“Youarewelcome”.ThemouthROIsareprovidedand Fig.1: Overviewoftheend-to-endvisualspeechrecognition they are downscaled to 26 by 44 in order to keep the aspect system. Two streamsare usedfor featureextractiondirectly ratioconstant. fromtherawimages. Thefirststreamextractsfeaturesfrom TheCUAVEdatasetcontains36subjectsspeakingdigits0 therawmouthROIandthesecondstreamfromthediffmouth to9,5timeseach,sointotalthereare180examplesperdigit. ROIinordertocapturelocaltemporaldynamics. The∆and Thenormalportionofthedatabaseisusedwherethesubjects ∆∆ features are also computedand appendedto the bottle- are in frontalposition. Sixty eight pointsare tracked on the necklayer. Theencodinglayersarepre-trainedusingRBMs. face using the tracker proposed in [18]. The faces are first The temporal dynamics are modelled by an LSTM in each alignedusinganeutralreferenceframeinordertonormalise stream. ABLSTMisusedtofusetheinformationfromboth them for rotation and size differences. This is done using streamsandprovidesalabelforeachinputframe. anaffinetransformusing5stablepoints,twoeyescornersin eacheyeandthetipofthenose.Thenthecenterofthemouth islocatedbasedonthetrackedpointsandaboundingboxwith two LSTM layers and trained to perform lipreadingdirectly size90by150isusedtoextractthemouthROIasshownin fromraw mouth ROIs. The system was tested on a subject- Fig. 2. Finally,themouthROIsaredownscaledto30by50. dependentexperimentontheGRIDcorpus[14]andalthough itoutperformedotherbaselinesystemsitfailedtooutperform 3. END-TO-ENDVISUALSPEECHRECOGNITION thestate-of-the-artresults[15]. In this paper, we present an end-to-end visual speech Theproposeddeeplearningsystemforvisualspeechrecogni- recognitionsystemwhichjointlylearnsthefeatureextraction tionisshowninFig.1.Itconsistsoftwoindependentstreams andclassification stages. To the best of our knowledge,this which extractfeaturesdirectly fromthe raw input. The first is the first end-to-end model which performs visual speech stream mainly encodes static information by extracting fea- recognitionfromrawmouthROIsand achievesstate-of-the- tures directly from the raw mouth ROI. The second stream art performance. The system consists of two streams, one encodes the local temporal dynamics by extracting features which encodes static information and one which encodes from the diff mouth ROI, which is computed by taking the local temporal dynamics. The former operates on the raw differencebetweentwoconsecutiveframes. mouthROIsandthelatteronthedifference(diff)images. An Both streamsfollowa bottleneckarchitecturein orderto LSTMmodelsthetemporaldynamicsineachstreamandthe compressthe highdimensionalinputimageto a low dimen- fusionofbothstreamsoccursthroughaBLSTM. sional representation at the bottleneck layer. The same ar- Weperformsubjectindependentexperimentsontwodif- chitecture as in [19] is used, where 3 sigmoid hidden layers ferentdatasets,OuluVS2andCUAVE.Anabsoluteimprove- areusedwith sizesof2000,1000and500,respectively,fol- ment of 9.7% over the baseline is reported on the OuluVS2 lowedbyalinearbottlenecklayer. Theseencodinglayersare Acceptedforpublication,ICASSP2017 Table 1: Classification Accuracyon the OuluVS2 database. Table 2: Classification Accuracy on the CUAVE database. Theend-to-endmodelsareevaluatedusingtheprotocolsug- The end-to-endmodel is trained using the same protocol as gestedin[22]where40subjectsareusedfortrainingandval- [4,23]where18subjectsareusedfortrainingandvalidation idationand12subjectsare usedfortesting. † Thesemodels and18fortesting. ∗Thismodelistrainedon28subjectsand usealeave-one-subject-outcrossvalidationforevaluation. tested on 8 subjects. † These models are trained and tested using a 6-fold cross validation. ‡ This model uses a visual Method Classification front-endwhichissignificantlymorecomplicatedthanours. Accuracy Method Classification End-to-End(RawImage) 78.0 Accuracy End-to-End(DiffImage) 75.8 End-to-End(RawImage) 71.4 End-to-End(Raw+DiffImages,Fig. 1) 84.5 End-to-End(DiffImage) 65.9 DCT+HMM[22]† 74.8 End-to-End(Raw+DiffImages,Fig. 1) 78.6 LatentVariableModels[22]† 73.0 DeepAutoencoder+SVM[4] 68.7 DeepBoltzmannMachines+SVM[23] 69.0 pre-trained in a greedy layer-wise manner using Restricted AAM+HMM[24]† 75.7 BoltzmannMachines(RBMs)[20]. The∆(firstderivatives) and∆∆(secondderivatives)[21]featuresarealsocomputed, Patch-basedFeatures+HMM[25]∗ 77.1 basedonthebottleneckfeatures,andtheyareappendedtothe VisemicAAM+HMM[26]†‡ 83.0 bottlenecklayer. Inthisway,duringtrainingweforcetheen- codinglayerstolearnrepresentationswhichproducegood∆ and∆∆features. 4.2. Preprocessing Finally, an LSTM layer is added on top of the encoding layers in order to model the temporal dynamics of the fea- Since all the experiments are subject independent we first turesin each stream. TheLSTM outputsof each stream are need to reduce the impact of subject dependent characteris- concatenatedandfedtoaBLSTMinordertofusetheinfor- tics. Thisis done bysubtractingthe meanimage, computed mationfrombothstreams. Theoutputlayerisasoftmaxlayer overtheentireutterance,fromeachframe. whichprovidesalabelforeachinputframe. Theentiresys- The next step is the normalisation of data. As recom- temistrainedend-to-endwhichenablesthejointlearningof mendedin[20]thedatashouldbez-normalised,i.e. themean features and classifier. In other words, the encoding layers andstandarddeviationshouldbeequalto0and1respectively, learntoextractfeaturesfromrawimageswhichareusefulfor beforetraininganRBMwithlinearinputunits. Hence,each classificationusingLSTMs. imageisz-normalisedbeforepre-trainingtheencodinglayers. 4. EXPERIMENTALSETUP 4.3. Training 4.1. EvaluationProtocol RBM Training: A Gaussian-Bernoulli RBM [20] is used for the first layer since the input (pixels) is real-valued, fol- Wefirstpartitionthedataintotrainingandtestsets. Thepro- lowedbytwoBernoulli-BernoulliRBMsandoneBernoulli- tocol suggested by the creators of the OuluVS2 database is Gaussian RBM for the linear bottleneck layer. Each RBM used [22] where 40 subjects are used for training and vali- is trained for 20 epochs with a mini-batch size of 100 and dation and 12 for testing. We randomlydivided the 40 sub- L2 regularisation coefficient of 0.0002 using contrastive di- jectsinto 30and 10subjectsfor trainingand validationpur- vergence. Thelearningrateisfixedto 0.1fortheBernoulli- poses, respectively. This means that there are 900 training BernoulliRBMs and to 0.001when one layer (inputor bot- utterances,300validationutterancesand360testutterances. tleneck)isreal-valuedassuggestedin[20]. Theevaluationprotocolsuggestedin[4]wasusedforex- End-to-EndTraining: TheAdaDeltaalgorithm[27],which perimentsontheCUAVEdatabase. Theodd-numberedsub- automaticallycomputesthelearningrateineachepoch,was jects(18intotal)areusedfortestingandtheeven-numbered used for training with a mini-batch size of 20 utterances. subjects are used for training. We further divided the latter Earlystoppingwithadelayof5epochswasalsousedinor- ones into 12 subjects for training and 6 for validation. This dertoavoidoverfitting. Gradientclippingwasappliedtothe meansthatthereare600,300and900training,validationand LSTM layers. The label of the last frame in each utterance testutterances,respectively. wasusedinordertolabeltheentireutterance. Acceptedforpublication,ICASSP2017 80 0 1 1 70 2 30 2 60 3 25 3 50 4 20 4 5 40 5 6 15 6 30 7 10 7 20 8 8 10 9 5 9 10 0 0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 Fig. 3: CUAVE confusion matrix. The labels for X and Y Fig. 4: OuluVS2 confusionmatrix. The labels for X and Y axescorrespondtodigits0to9. axescorrespondtothe10phrasesdescribedinsection2. 5. RESULTS datasets. In the OuluVS2 dataset, the most confusionswere between phrases 3 (Hello) and 8 (Thank you) and between ResultsfortheOuluVS2databaseareshowninTable1. Since phrases6(Seeyou)and9(Haveagoodtime).IntheCUAVE thisdatabasehasbeenreleasedrecentlyonlythebaselinere- dataset, number pairs zero and two, six and nine were most sultsprovidedbythecreatorsareavailable.Thebestprovided frequentlyconfused. Zero and two share similar viseme se- baselineresult,74.8%,isachievedbyHMMsincombination quencesneartheendoftheutterancewhilesixandnineshare with DCT features. We first test each stream of the end- similar viseme sequences at the start of the utterance which to-end model individually, i.e., just the encoding layers and explains the more frequently occurring confusions for these the LSTM layer are considered. Itis interesting to note that numberpairs. bothstreamsoutperformthebaselineperformance. Thebest Finally, we should also mention that we experimented overallresultis achievedby the end-to-end2-stream model, with convolutional neural networks for the encoding layers showninFig. 1,withaclassificationaccuracyof84.5%. We butthisledto worseperformancethanthe proposedsystem. shouldalsoemphasisethatthebaselineperformanceisevalu- This is also reportedin [13] and it is likely due to the small ated usinga leave-one-subject-outcross validationapproach training sets. We also used data augmentation which im- whichmeansthereare51subjectsfortrainingandvalidation provedthe performancebut did not exceed the performance andonlyonesubjectfortestingineachiteration.Ontheother oftheproposedsystem. hand,weusemuchfewersubjectsfortrainingandvalidation, 40,andmanymoresubjectsfortesting,12,whichmakesthe problemmorechallenging. Eveninthiscase,theend-to-end 6. CONCLUSION systemresultsinasignificantimprovementoverthebaseline performance. Results for the CUAVE database are shown in Table 2. In this work, we present an end-to-endvisual speech recog- There is not a standard evaluationprotocolfor this database nitionsystemwhichjointlylearnstoextractfeaturesdirectly which makes comparison between different works difficult. from the pixels and perform classification using LSTM net- Only[4]and[23]usethesameevaluationprotocolasinthis works. Results on subject independentexperimentsdemon- study. Weseethatthesingle-streamend-to-endmodelbased strate that the proposed model achieves state-of-the-art per- on raw mouth ROIs outperforms both previous works. The formanceontheOuluVS2andCUAVEdatabaseswhencom- 2-stream end-to-end model outperforms all approaches that paredwithmodelswhichuseasimilarvisualfrontend. The useasimilarvisualfront-end. Thisincludes[24]wherea6- model can be easily extended to multiple streams so we fold cross validation was used with 30 subjects for training are planning to add an audio stream in order to evaluate its and validation and 6 for testing, and [25] where 28 subjects performanceonaudiovisualspeechrecognitiontasks. were used for training and validation and 10 for testing. In this study, we use much fewer subjects, 18, for training and validationandmanymoresubjectsfortesting,18. Only[26] 7. ACKNOWLEDGEMENTS achieves a higher performance than our end-to-end system, butamuchmorecomplicatedvisualfront-endisused,witha cascadeofactiveappearancemodels(AAM),andthemodel This work has been funded by the European Community isevaluatedusinga6-foldcross-validation. Horizon 2020 under grant agreement no. 645094 (SEWA) Figures 3 and 4 show the confusion matrices for both andno. 688835(DE-ENIGMA). Acceptedforpublication,ICASSP2017 8. REFERENCES [14] M.Cooke,J.Barker,S.Cunningham,andX.Shao, “An audio-visualcorpusforspeechperceptionandautomatic [1] G.Potamianos,C.Neti,G.Gravier,A.Garg,andA.W. speechrecognition,” TheJournaloftheAcousticalSo- Senior, “Recentadvancesin the automaticrecognition cietyofAmerica,vol.120,no.5,pp.2421–2424,2006. of audiovisual speech,” Proceedings of the IEEE, vol. [15] A. H. Abdelaziz, S. Zeiler, and D. Kolossa, “Learning 91,no.9,pp.1306–1326,Sept2003. dynamicstreamweightsforcoupled-hmm-basedaudio- [2] S.DupontandJ.Luettin, “Audio-visualspeechmodel- visualspeechrecognition,” IEEETrans.onAudio,Sp., ingforcontinuousspeechrecognition,” IEEETrans.on andLang.Proc.,vol.23,no.5,pp.863–876,May2015. Multimedia,vol.2,no.3,pp.141–151,Sep2000. [16] I. Anina, Z. Zhou, G. Zhao, and M. Pietikinen, [3] Z.Zhou,G.Zhao,andM.Pietikinen, “Towardsaprac- “Ouluvs2: A multi-view audiovisualdatabase for non- ticallipreadingsystem,” inIEEECVPR,2011,pp.137– rigidmouthmotionanalysis,” inIEEEFG,2015. 144. [17] E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy, “Moving-talker,speaker-independentfeaturestudy,and [4] J.Ngiam,A.Khosla,M.Kim,J.Nam,H.Lee,andA.Y baselineresultsusingthecuavemultimodalspeechcor- Ng, “Multimodal deep learning,” in Proc. of ICML, pus,” EURASIPJ.Appl.SignalProcess.,vol.2002,no. 2011,pp.689–696. 1,pp.1189–1201,Jan.2002. [5] H. Ninomiya, N. Kitaoka, S. Tamura, Y. Iribe, and [18] V.KazemiandJ.Sullivan, “Onemillisecondfacealign- K.Takeda, “Integrationofdeepbottleneckfeaturesfor ment with an ensemble of regression trees,” in IEEE audio-visualspeechrecognition,” inConf.oftheInter- CVPR,2014,pp.1867–1874. nationalSpeechCommunicationAssociation,2015. [19] G.HintonandSalakhutdinovR., “Reducingthedimen- [6] C. Sui, R. Togneri, and M. Bennamoun, “Extracting sionality of data with neural networks,” Science, vol. deepbottleneckfeaturesforvisualspeechrecognition,” 313,no.5786,pp.504–507,2006. inICASSP,2015,pp.1518–1522. [20] G.Hinton,“Apracticalguidetotrainingrestrictedboltz- [7] J.HuangandB.Kingsbury,“Audio-visualdeeplearning mann machines,” in Neural Networks: Tricks of the fornoiserobustspeechrecognition,” inIEEEICASSP, Trade,pp.599–619.Springer,2012. 2013,pp.7596–7599. [21] S.Young,G.Evermann,M.Gales,T.Hain,D.Kershaw, [8] Y. Mroueh, E. Marcheret, and V. Goel, “Deep multi- X.Liu,G.Moore,J.Odell,D.Ollason,D.Povey,etal., modallearningforaudio-visualspeechrecognition,” in “Thehtkbook,” vol.3,pp.175,2002. IEEEICASSP,2015,pp.2130–2134. [22] ,”http://ouluvs2.cse.oulu.fi. [9] C.Sui,M.Bennamoun,andR.Togneri,“Listeningwith [23] N.SrivastavaandR.Salakhutdinov,“Multimodallearn- your eyes: Towards a practical visual speech recogni- ing with deep boltzmann machines,” J. Mach. Learn. tion systemusingdeepboltzmannmachines,” inIEEE Res.,vol.15,no.1,pp.2949–2980,Jan.2014. ICCV,2015,pp.154–162. [24] G.Papandreouetal., “Multimodalfusionandlearning [10] Y. Li, Y. Takashima, T. Takiguchi, and Y. Ariki, “Lip with uncertain features applied to audiovisual speech readingusingadynamicfeatureoflipimagesandcon- recognition,” in Workshop on Multimedia Signal Pro- volutional neural networks,” in IEEE/ACIS Intl. Conf. cessing,2007,pp.264–267. onComputerandInformationScience,2016,pp.1–6. [25] P.LuceyandS.Sridharan, “Patch-basedrepresentation [11] S. Petridis and M. Pantic, “Deep complementary bot- ofvisualspeech,” inProc.oftheHCSNetWorkshopon tleneckfeaturesforvisualspeechrecognition,” inIEEE UseofVisioninHCI,2006,pp.79–85. ICASSP,2016,pp.2304–2308. [26] G. Papandreou, A. Katsamanis, V. Pitsikalis, and [12] K.Noda,Y.Yamaguchi,K.Nakadai,H.G.Okuno,and P.Maragos,“Adaptivemultimodalfusionbyuncertainty T.Ogata, “Audio-visualspeechrecognitionusingdeep compensation with application to audiovisual speech learning,” AppliedIntelligence,vol.42,no.4,pp.722– recognition,” IEEETrans. onAudio,Speech,andLan- 737,2015. guageProcessing,vol.17,no.3,pp.423–435,2009. [13] M. Wand, J. Koutn, and J. Schmidhuber, “Lipreading [27] M. D. Zeiler, “ADADELTA: an adaptive learning rate withlongshort-termmemory,” inIEEEICASSP,2016, method,” vol.http://arxiv.org/abs/1212.5701,2012. pp.6115–6119.