ebook img

Vid2speech: Speech Reconstruction from Silent Video PDF

1.2 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Vid2speech: Speech Reconstruction from Silent Video

VID2SPEECH:SPEECHRECONSTRUCTIONFROMSILENTVIDEO ArielEphratandShmuelPeleg TheHebrewUniversityofJerusalem Jerusalem,Israel ABSTRACT 7 Speechreading is a notoriously difficult task for humans to 1 0 perform. Inthispaperwepresentanend-to-endmodelbased 2 on a convolutional neural network (CNN) for generating an intelligible acoustic speech signal from silent video frames n a of a speaking person. The proposed CNN generates sound J features for each frame based on its neighboring frames. 9 Waveforms are then synthesized from the learned speech features to produce intelligible speech. We show that by ] V leveraging the automatic feature learning capabilities of a CNN, we can obtain state-of-the-art word intelligibility on C the GRID dataset, and show promising results for learning . s out-of-vocabulary(OOV)words. c [ IndexTerms— Speechreading,visualspeechprocessing, Fig.1.OurCNN-basedmodeltakestheframesofsilentvideo articulatory-to-acoustic mapping, speech intelligibility, neu- asinput,andpredictssoundfeatureswhichareconvertedinto 2 intelligiblespeech.Soundfeaturesarecalculatedbyperform- v ralnetworks ing8th-orderLPCanalysisandLSPdecompositiononhalf- 5 9 overlapping audio frames of 40ms each. Concatenating ev- 1. INTRODUCTION 4 ery two successive LSP vectors results in a feature vector 0 S ∈R18. 0 Speechreading is the task of obtaining reliable phonetic in- i . formationfromaspeaker’sfaceduringspeechperception. It 1 0 has been described as “trying to grasp with one sense infor- areextractedfromtheframesandfedtoaclassifier. Wandet 7 mation meant for another”. Given the fact that often several al. [5],Assaeletal. [6]andChungetal. [7]haveallrecently 1 phonemes (phonetic units of speech) correspond to a single showedstate-of-the-artwordandsentence-levelclassification v: viseme(visualunitofspeech),itisanotoriouslydifficulttask resultsusingneuralnetwork-basedmodels. i forhumanstoperform. Thesecondapproach,andtheoneusedinthiswork,isto X Several applications come to mind for automatic video- model speechreading as an articulatory-to-acoustic mapping r a to-speech systems: Enabling videoconferencing from within probleminwhichthe“label”ofeachshortvideosegmentis anoisyenvironment;facilitatingconversationatapartywith acorrespondingfeaturevectorrepresentingtheaudiosignal. loudmusicbetweenpeoplehavingwearablecamerasandear- KelloandPlaut[8]andHueberandBailly[9]attemptedthis pieces; maybeevenusingsurveillancevideoasalong-range approachusingvarioussensorstorecordmouthmovements. listeningdevice. LeCornuandMilner[10]tookthisdirectioninarecentwork Much work has been done in the area of automating wheretheyusedhand-craftedvisualfeaturestoproduceintel- speechreading by computers [1, 2, 3]. There are two main ligibleaudio. approaches to this task. The first, and the one most widely A major advantage of this model of learning is its non- attemptedinthepast,consistsofmodelingspeechreadingas dependencyonaparticularsegmentationoftheinputdatainto a classification problem. In this approach, the input video words or sub-words. It does not either need to have explicit is manually segmented into short clips which contain either manually-annotated labels, but rather uses “natural supervi- wholewordsfromapredefineddictionary, orpartsofwords sion” [11], in which the prediction target is derived from a comprising phonemes or visemes [4]. Then, visual features naturalsignalintheworld. Aregression-basedmodelisalso vocabulary-agnostic. Givenatrainingsetwithalargeenough ThisresearchwassupportedbyIsraelScienceFoundation,byDFGand byIntelICRI-CI. representation of the phonemes/visemes of a particular lan- speech production [12]. LPC analysis is applied to overlap- ping audio frames of the original speech signal, resulting in anLPCcoefficientvectorwhoseorderPcanbetuned. Line SpectrumPairs(LSP)[13]arearepresentationofLPCcoeffi- cients which are more stable and robust to quantization and small coefficient deviations. LSPs are therefore useful for speech coding and transmission over a channel, and indeed provedtobewellsuitedtothetaskathand. Weapplythefollowingproceduretocalculateaudiofea- Fig. 2. This figure illustrates (1) the importance of allowing turessuitableforuseasneuralnetworkoutput: First,theau- the network to learn visual features from the speaker’s en- dio from each video sequence is downsampled to 8kHz and tireface,asopposedtothemouthregiononly;(2)thedisam- splitintoaudioframesof40ms(320samples)each,withan biguationeffectofusingtemporalcontext. Greenline(top)is overlap of 20ms. 8th-order LPC analysis is applied to each testerrorasafunctionofinputcliplengthKwhenusingonly audio frame, as done by [10], followed by LSP decomposi- mouthregion,andblueline(bottom)istesterrorwhenusing tion,resultinginafeaturevectoroflength9perframe.While full face region. Face region error is 40% lower than mouth 8th-orderLPCisrelativelylowforhigh-fidelitymodelingof regioninthebestconfiguration. the speech spectrum, we did so in order to isolate the effect ofusingCNN-learnedvisualfeaturesversusthehand-crafted ones of [10]. Each video frame has two successive corre- guage, it can reconstruct words that are not present in the sponding feature vectors, which are concatenated to form a trainingset.Classificationatthesub-wordlevelcanalsohave sound vector, S ∈ R18. See Figure 1 for an illustration of i the same effect. Another advantage to this model is its abil- thisprocedure. Finally,thevectorsarestandardizedelement- itytoreconstructthenon-textualpartsofhumanspeech,e.g. wise by subtracting the mean and dividing by the standard emotion,prosody,etc. deviationofeachelement. Researchers have spent much time and effort finding vi- sualfeatureswhichaccuratelymapfacialmovementstoaudi- 3. PREDICTINGSPEECH torysignal. Webypasstheneedforfeaturecraftingbyutiliz- ingCNNs, whichhavebroughtsignificantadvancestocom- 3.1. Regressingsoundfeatures puter vision in recent years. Given raw visual data as input, our network automatically learns optimal visual features for Given a sequence of input frames I ,I ,...,I we would 1 2 N reconstructinganacousticsignalclosesttotheoriginal. like to estimate a corresponding sequence of sound features In this paper, we: (1) Present an end-to-end CNN-based S ,S ,...,S whereS ∈R18. 1 2 N i modelthatpredictsthespeechaudiosignalofasilentvideoof apersonspeaking,significantlyimprovingstate-of-the-artre- Input representation Our goal is to reconstruct a single constructedspeechintelligibility;(2)demonstratethatallow- audio representation vector S which corresponds to the du- ing the model to learn from the speaker’s entire face instead i ration of a single video frame I . However, instantaneous of only the mouth region greatly improves performance; (3) i lip movements such as those in isolated video frames can show that modeling speechreading as a regression problem besignificantlydisambiguatedbyusingatemporalneighbor- allowsustoreconstructout-of-vocabularywords. hoodascontext. Therefore,theinputtoournetworkisaclip of K consecutive grayscale video frames, out of which the 2. SPEECHREPRESENTATION speaker’sfaceiscroppedandscaledto128×128pixels.This results in an input volume of size 128 × 128 × K scalars, The challenge of finding a suitable representation for an whichisthennormalizedbydividingbymaximumpixelin- acoustic speech signal which can be estimated by a neural tensityandsubtractingthemean. network on one hand, and synthesized back into intelligible Figure 2 illustrates the importance of allowing the net- audioontheother,isnottrivial. Spectrogrammagnitude,for worktolearnvisualfeaturesfromtheentireface,asopposed example,canbeusedasnetworkoutput,howeverthequality tothemouthregiononly,aswidelydoneinthepast. Thetwo of its resynthesis into speech is usually poor, as it does not linesinthegraphrepresentfinalnetworktesterrorasafunc- containphaseinformation. Useofrawwaveformasnetwork tionofthelengthK oftheclipusedasinputtotheCNN.We outputwasruledoutforlackofasuitablelossfunctionwith tested the values of K ∈ {1,3,5,7,9}, while the output S i whichtotrainthenetwork. alwaysremainedthesoundfeaturesofthecenterframe. Not LinearPredictiveCoding(LPC)isapowerfulandwidely surprisingly,thelargestgaininperformanceforbothfaceand usedtechniqueforrepresentingthespectralenvelopeofadig- mouth regions is when clip length is increased from K = 1 ital speech signal, which assumes a source-filter model of frametoK = 3frames,highlightingtheimportanceofcon- al n gi Ori d e ct u nstr o c e R (a) LSP coefficients (b) Waveform (c) Spectrogram Fig. 3. Examples of original (top) and reconstructed (bottom): (a) LSP coefficients, (b) waveform and (c) spectrogram. The vertical columns of (a) are the actual output of the CNN. Spectral envelope of reconstructed audio (c) is relatively accurate, howeverunvoicedexcitationresultsinthelackofformants(horizontallinesinsidespectralenvelope,representingfrequencyof voicedspeech). text. The advantage of learning features from the full facial Command Color Preposition Letter Digit Adverb information is also evident, with the best face region error bin blue at A-Z 0-9 again 40%lowerthanthebestmouthregionerror(bothatK =9). lay green by minusW now WehypothesizethatthisisasresultofourCNNusingthein- place red in please creasedamountofvisualinformationtodisambiguatesimilar set white with soon mouthmovements. Sound prediction model We use a convolutional neu- Table1. GRIDsentencegrammar. ral network (CNN) that takes the aforementioned video clip of size 128 × 128 × K as input. Our network uses VGG-like [14] stacks of small 3×3 receptive fields in its Implementation details Our network implementation is convolutional layers. The architecture comprises five con- based on the Keras library [16] built on top of TensorFlow secutive conv3 − conv3 − maxpool blocks consisting of [17]. Network weights are initialized using the initialization 32−32−64−128−128kernels,respectively.Thesearefol- proceduresuggestedbyHeetal. [18]. WeuseLeakyReLU lowedbytwofullyconnectedlayerswith512neuronseach. [19] as the non-linear activation function in all layers but The last layer of our CNN is of size 18 which corresponds the last two, in which we use the hyperbolic tangent (tanh) to the size of the sound representation vectors we wish to function. Adam optimizer [20] is used with a learning rate predict. The network is trained with backpropagation using of 0.003. Dropout [21] is used to prevent overfitting, with meansquarederror(MSE)loss. a rate of 0.25 after convolutional layers and 0.5 after fully connectedones. Weusemini-batchesof32trainingsamples 3.2. Generatingawaveform each and stop training when the validation loss stops de- creasing(around80epochs). Trainingisdoneusingasingle Source-filter speech synthesizers such as [15] use both fil- Nvidia Titan Black GPU. We use a cascade-based face de- terparametersaswellasanexcitationsignaltoconstructan tectorfromOpenCV [22],andcropoutthemouthregionfor acoustic signal from LPC features. Predicting excitation pa- thecomparisoninFigure2byusingahard-codedmask. For rametersisoutofthescopeofthiswork,andwethereforeuse LPCanalysis/resynthesis,aswellasexcitationgeneration,we Gaussianwhitenoiseastheexcitationsignal. Thisproduces usedpysptk,aPythonwrapperforSpeechSignalProcessing an unvoiced speech signal and results in unnatural sounding Toolkit(SPTK)[23]. speech. Although this method of generating a waveform is relatively simplistic, we found that it worked quite well for 4.1. GRIDcorpus speechintelligibilitypurposes,whichisthefocusofourwork. WeperformedourexperimentsontheGRIDaudiovisualsen- 4. EXPERIMENTS tencecorpus[24], alargedatasetofaudioandvideo(facial) recordingsof1000sentencesspokenby34talkers(18male, Weappliedourspeech-reconstructionmodeltoseveraltasks, 16female). Eachsentenceconsistsofasixwordsequenceof andevaluateditwithahumanlisteningstudy.1 theformshowninTable1,e.g. “PlacegreenatH7now”. 1Examplesofreconstructedspeechcanbefoundat A total of 51 different words are contained in the GRID http://www.vision.huji.ac.il/vid2speech corpus. Videoshaveafixeddurationof3secondsataframe rate of 25 FPS with 720×576 resolution, resulting in se- Ours quences comprising 75 frames. These videos are prepro- [10] S4 S2 cessed as described in Section 3.1 before feeding them into Audio-only 40.0% 82.6% - thenetwork. TheacousticpartoftheGRIDcorpusisusedas Audio-visual 51.9% 79.9% 79% describedinSection2. In order to accurately compare our results with [10], we performedourexperimentsonthe1000videosofspeakerfour Table 2. Our reconstructed speech is significantly more in- (S4,female)asdonethere. Thetraining/testingsplitforeach telligible than the results of [10]. We tested our model on experimentwillbedescribedinthefollowingsections. videos from two different speakers in the GRID corpus, S2 (male) and S4 (female). Randomly guessing a word from 4.2. Soundpredictiontasks eachGRIDcategorywouldresultin19%“intelligibility”. Reconstructionfromfulldataset Thefirsttask,proposed by[10],isdesignedtoexaminewhetherreconstructingaudio OOV Noneout Chance fromvisualfeaturescanproduceintelligiblespeech. Forthis Audio-visual 51.6% 93.4% 10.0% taskwetrainedourmodelonarandom80/20train/testsplit of the 1000 videos of S4 and made sure that all 51 GRID words were represented in each set. The resulting represen- Table3. Out-of-vocabulary(OOV)intelligibilityresults. We tation vectors were converted back into waveform using un- tested this by reconstructing spoken digits which were left voiced excitation, and two different multimedia configura- outofthetrainingset. Listenerswerefivetimesmorelikely tionswereconstructed:thepredictedaudio-onlyandthecom- tochoosethecorrectdigitthanrandomlyguessing, however binationoftheoriginalvideowithreconstructedaudio. onlyslightlymorethanhalfaslikelycomparedtohavingall digitsrepresentedinthetrainingset. Reconstructingout-of-vocabularywords Ascitedearlier, regression-based models can be used to reconstruct out-of- vocabulary(OOV)words. Totestthis,weperformedthefol- testingourmodelonanotherspeakerfromtheGRIDcorpus, lowingexperiment: Thevideosinourdatasetweresortedac- speaker two (S2, male), whose speech clarity is comparable cordingtothedigitutteredineachsentence,andournetwork to S4, as reported by [24]. We used the same listening test wastrainedandtestedonfivedifferenttrain/testsplits-each methodology described above, however this time only using withtwodistinctdigitsleftoutofthetrainingset. Forexam- combined audio and video. Examples of original vs. recon- ple, thenetworkwastrainedonallsequenceswiththenum- structedLSPcoefficients,waveformandspectrogramforthis bers 1−8 uttered, and tested only on sequences containing taskcanbeseeninFigure3. thenumbers9and0. Results for the OOV task which appear in Table 3 were obtainedbyaveragingdigitannotationaccuraciesofthefive 4.3. Evaluatingthespeechpredictions train/test splits. The fact that human subjects were over five timesmorelikelythanchancetochoosethecorrectdigitut- Weassessedtheintelligibilityofthereconstructedspeechus- teredafterlisteningtothereconstructedaudioshowsthatus- ing a human listening study done using Amazon Mechani- ingregressiontosolvetheOOVproblemisapromisingdirec- calTurk(MTurk). Eachjobconsistedoftranscribingoneof tion. Moreover, using a larger and more diversified training three types of 3-second clips: audio-only, audio-visual and setvocabularyislikelytosignificantlyincreaseOOVrecon- OOV audio-visual. The listeners were unaware of the dif- structionintelligibility. ferences between the clips. For each clip, they were given theGRIDvocabularyandtaskedwithclassifyingeachrecon- structed word into one of its possible options. All together, 5. CONCLUDINGREMARKS over 400 videos containing 38 distinct sequences were tran- scribedby23differentMTurkworkers,whichiscomparable This work has proven the feasibility of reconstructing an in- tothe20-listenerstudydoneby[10]. telligibleaudiospeechsignalfromsilentvideosframes.OOV wordreconstructionwasalsoshowntoholdpromisebymod- eling automatic speechreading as a regression problem, and 4.4. Results usingaCNNtoautomaticallylearnrelevantvisualfeatures. Table2showstheresultsofourfirsttask,reconstructionfrom Theworkdescribedinthispapercanserveasabasisfor thefulldataset,alongwithacomparisonto[10]. Ourrecon- several directions of further research. These include using a structed audio is significantly more intelligible than the best lessconstrainedvideodatasettoshowreal-worldreconstruc- resultsof[10],asshownbybothaudio-onlyandaudio-visual tion viability and generalizing to speaker-independent and tests. The final column shows the result of retraining and multiplespeakerreconstruction. 6. REFERENCES [13] FumitadaItakura, “Linespectrumrepresentationoflin- earpredictorcoefficientsofspeechsignals,” TheJour- [1] Eric David Petajan, Automatic lipreading to enhance naloftheAcousticalSocietyofAmerica,vol.57,no.S1, speechrecognition(speechreading), Ph.D.thesis,Uni- pp.S35–S35,1975. versityofIllinoisatUrbana-Champaign,1984. [14] Karen Simonyan and Andrew Zisserman, “Very deep [2] IainMatthews,TimothyFCootes,JAndrewBangham, convolutional networks for large-scale image recogni- StephenCox,andRichardHarvey,“Extractionofvisual tion,” arXiv:1409.1556,2014. features for lipreading,” IEEE Transactions on Pattern [15] DennisHKlattandLauraCKlatt, “Analysis,synthesis, Analysis and Machine Intelligence, vol. 24, no. 2, pp. andperceptionofvoicequalityvariationsamongfemale 198–213,2002. andmaletalkers,” theJournaloftheAcousticalSociety [3] Ziheng Zhou, Guoying Zhao, Xiaopeng Hong, and ofAmerica,vol.87,no.2,pp.820–857,1990. Matti Pietika¨inen, “A review of recent advances in vi- [16] Franc¸oisChollet, “Keras,”https://github.com/ sual speech decoding,” Image and vision computing, fchollet/keras,2015. vol.32,no.9,pp.590–605,2014. [17] “Tensorflow,” Software available from http:// [4] HelenLBearandRichardHarvey, “Decodingvisemes: tensorflow.org/. Improving machine lip-reading,” in ICASSP’16, 2016, pp.2009–2013. [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Delving deep into rectifiers: Surpassing human- [5] MichaelWand,JanKoutn,etal., “Lipreadingwithlong level performance on imagenet classification,” in short-term memory,” in ICASSP’16, 2016, pp. 6115– ICCV’15,2015. 6119. [19] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng, [6] Yannis M Assael, Brendan Shillingford, Shimon “Rectifiernonlinearitiesimproveneuralnetworkacous- Whiteson, and Nando de Freitas, “Lipnet: End- ticmodels,” inICML’13,2013. to-end sentence-level lipreading,” arXiv preprint arXiv:1611.01599,2016. [20] DiederikKingmaandJimmyBa, “Adam: Amethodfor stochasticoptimization,” arXiv:1412.6980,2014. [7] Joon Son Chung, Andrew Senior, Oriol Vinyals, and AndrewZisserman, “Lipreadingsentencesinthewild,” [21] NitishSrivastava,GeoffreyEHinton,AlexKrizhevsky, arXivpreprintarXiv:1611.05358,2016. Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfit- [8] ChristopherTKelloandDavidCPlaut, “Aneuralnet- ting.,” JournalofMachineLearningResearch,vol.15, work model of the articulatory-acoustic forward map- no.1,pp.1929–1958,2014. ping trained on recordings of articulatory parameters,” [22] G.Bradski, “Opencv,” Dr.Dobb’sJournalofSoftware The Journal of the Acoustical Society of America, vol. Tools,2000. 116,no.4,pp.2354–2364,2004. [23] “Speech signal processing toolkit,” Available from [9] ThomasHueberandGe´rardBailly, “Statisticalconver- http://sp-tk.sourceforge.net/readme. sionofsilentarticulationintoaudiblespeechusingfull- php. covariance hmm,” Comput. Speech Lang., vol. 36, no. C,pp.274–293,Mar.2016. [24] Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao, “An audio-visual corpus for speech percep- [10] Thomas Le Cornu and Ben Milner, “Reconstructing tionandautomaticspeechrecognition,” TheJournalof intelligible audio speech from visual speech features,” the Acoustical Society of America, vol. 120, no. 5, pp. in Sixteenth Annual Conference of the International 2421–2424,2006. SpeechCommunicationAssociation,2015. [11] AndrewOwens,PhillipIsola,JoshMcDermott,Antonio Torralba, Edward H Adelson, and William T Freeman, “Visuallyindicatedsounds,” inCVPR’16,2016. [12] Gunnar Fant, Acoustic theory of speech production: withcalculationsbasedonX-raystudiesofRussianar- ticulations,vol.2, WalterdeGruyter,1971.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.