ebook img

Audio-Visual Speech Processing for Multimedia Localisation Matthew Aaron Benatan PDF

161 Pages·2017·4.88 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Audio-Visual Speech Processing for Multimedia Localisation Matthew Aaron Benatan

Audio-Visual Speech Processing for Multimedia Localisation by Matthew Aaron Benatan Submitted in accordance with the requirements for the degree of Doctor of Philosophy The University of Leeds School of Computing September 2016 Declarations The candidate confirms that the work submitted is his/her own, except where work which has formed part of a jointly authored publication has been included. The contrinbution of the candidate and the other authors to this work has been explic- itly indicated below. The candidate confirms that appropriate credit has been given withinthethesiswherereferencehasbeenmadetotheworkofothers. Chapterthreeextendsworkfromthefollowingpublications: MattBenatanandKiaNg. Cross-Covariance-BasedFeaturesforSpeechClassi- ficationinFilmAudio. JournalofVisualLanguagesandComputing,volume31, PartB,215-221. 2015. Matt Benatan and Kia Ng. Cross-Covariance-Based Features for Speech Clas- sification in Film Audio. Proceedings of the 21st International Conference on DistributedMultimediaSystems,72-77. 2015. Thefollowingpublicationscontainearlyversionsofconceptsdiscussedinchapterfive: Matt Benatan and Kia Ng. Feature Matching of Simultaneous Signals for Mul- timodal Synchronization. Proceedings of the 2nd International Conference on InformationTechnologiesforPerformingArts,MediaAccess,andEntertainment, volume7990ofLectureNotesinComputerScience,266-275. 2013. Matt Benatan and Kia Ng. Multimodal Feature Matching for Event Synchro- nization. In Proceedings of the 19th International Conference on Distributed MultimediaSystems,9-13. 2013. The candidate confirms that the above jointly authored publications are primarily theworkofthefirstauthor. Theroleofthesecondauthorwaseditorialandsupervi- sory. This copy has been supplied on the understanding that it is copyright material and that no quotation from the thesis may be published without proper acknowledge- ment. (cid:13)c 2016TheUniversityofLeedsandMatthewAaronBenatan i Abstract Formanyyears,filmandtelevisionhavedominatedtheentertainmentindustry. Recently, with the introduction of a range of digital formats and mobile devices, multimedia’s ubiquityasthedominantformofentertainmenthasincreaseddramatically. This,inturn, hasincreaseddemandontheentertainmentindustry,withproductioncompanieslookingto increasetheirrevenuebyprovidingentertainmentmediatoagrowinginternationalmarket. This brings with it challenges in the form of multimedia localisation - the process of preparingcontentfor internationaldistribution. Theindustryis now lookingtomodernise productionprocesses-movingwhatwereoncewhollymanualpracticestosemi-automated workflows. Akeyaspectofthelocalisationprocessisthealignmentofcontent,suchassubtitles oraudio,whenadaptingcontentfromoneregiontoanother. Onemethodofautomating this is through using audio content as a guide, providing a solution via audio-to-text alignment. While manyapproachesforaudio-to-textalignmentcurrently exist,theseall require language models - meaning that dozens of languages models wouldbe required for these approaches to be reliably implemented in large production companies. To address this, this thesis explores the development of audio-to-text alignment procedures which do not rely on language models, instead providing a language independent method for aligningmultimedia content. Toachievethis, theprojectexploresbothaudio andvisual speechprocessing,withafocusonvoiceactivitydetection,asameansforsegmentingand aligningaudioandtextdata. Thethesisfirstpresentsanovelmethodfordetectingspeechactivityinentertainment media. This methodiscomparedwith currentstateofthe art,anddemonstratessignificant improvement over baseline methods. Secondly, the thesisexplores a novel set of features for detecting voice activity in visual speech data. Here, we show that the combination oflandmarkandappearance-basedfeaturesoutperformsrecentmethodsforvisualvoice activitydetection,andspecificallythattheincorporationoflandmarkfeaturesisparticularly crucialwhenpresentedwithchallengingnaturalspeechdata. Lastly,aspeechactivity-based alignmentframeworkispresentedwhichdemonstratesencouragingresults. Here,weshow ii that Dynamic Time Warping (DTW) can be used for segment matching and alignment of audio and subtitle data, and we also present a novel method for aligning scene-level contentwhichoutperformsDTWforsequencealignmentoffiner-leveldata. Toconclude, we demonstrate that combining global and local alignment approaches achieves strong alignment estimates, but that the resulting output is not sufficient for wholly automated subtitlealignment. Wethereforeproposethatthisbeusedasaplatformforthedevelopment of lexical-discovery based alignment techniques, as the general alignment provided by our systemwouldimprovesymbolicsequencediscoveryforsparsedictionary-basedsystems. iii Acknowledgements Firstly I would like to thank Dr. Kia Ng - my supervisor and my friend. Working with you over the past seven years has been enormously inspiring. Thank you for your mentorship,encouragement,andforteachingmethevalueofpursuingcrazyideas(andfor thewisdomtoknowwhentheideasareperhapsalittletoocrazy). ThankyoualsotoAndyBulpittfortakingonsupervisionduringtheconcludingmonths ofthisPhD.Youhavedoneasuperbjobofpickinguptheprojectatthelastminute,and I’mverygratefulfortheguidanceyou’veprovidedthroughthiscrucialtime. Iwouldalso liketothankmyco-supervisorsDerekMageeandKatjaMarkert. Ourconversationswere alwaysveryvaluable,andhelpedmetoexpandmyunderstandingof,andinterestin,the fieldIhavechosentopursue. TomyfriendsatTheUniversityofLeeds-Sam,Alicja,Luke,Bernhard,Leroy,Chris- tian,Dan,Olly,Jen,AryanaandallintheSchoolofComputing. Youeachcontributedto makingmy timeatLeeds thoroughlyenjoyable. Thankyoufor thefascinatingdiscussions, theeveningsspentatthepub,andforhelpingtokeepmesanethroughoutthisPhD. I would also like to thank the EPSRC and ZOO Digital - the collaboration between academia and industry has been hugely valuable, and I am very grateful to both for providingthefinancialsupportrequiredforthiswork. ThankyoualsotoStuartGreenfor yourcommitment,supportandguidancethroughoutthisproject. ThankyoualsotoDukeUniversityandtheUniversityofLeedsforgrantingpermission tousetheiraudiovisualcontentwithinthiswork. TomyclosefriendsRobandChris-fromallthoseyearsplayinginbands,tomaking wacky videos, to simply geeking out over lengthy sessions of tech-talk or boardgames. You have each been instrumental in the development of my interests - both creatively andtechnologically-andI’mhugelythankfulforyoursupport,guidanceandcontinuing friendship. Thank you also to my long standing friend, Dr. Thomas Hazlehurst, for the lunchdates,theLaTeXtuitionandforyourincrediblebeard. Whilerewarding,thepastfewyearshavealsobeenincrediblydemanding,bothintel- lectually and emotionally. This experience would have been far more difficult without the tremendousdedicationandsupportof mybestfriendandpartner, Rebecca,whohaskept meonasteadypaththroughouttheupsanddownsofPhDlife. Iamdeeplygratefultoyou foryourcontinuinglove,supportandpatience. Lastly, thank you to my parents, Dan and Debby. Your tremendous support and encouragementthroughouttheyearshasmotivatedmetocontinuechallengingmyselfto embarkonnewandinterestingpursuits. Withoutyoursupport,Icertainlywouldn’thave engagedinsucharewardingandvaluablejourney. iv Contents 1 Introduction 1 1.1 OverviewofFilmPostProductionWorkflows . . . . . . . . . . . . . . . 1 1.1.1 AutomaticDialogueReplacement . . . . . . . . . . . . . . . . . 2 1.1.2 FormatConversion . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.3 SubtitleLocalisationandDialogueAdaptation . . . . . . . . . . 3 1.2 MotivationandContributions . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 ThesisOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Background 9 2.1 AudioSpeechProcessing . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 AudioSpeechFeatureExtraction . . . . . . . . . . . . . . . . . 9 2.1.2 AudioVoiceActivityDetection . . . . . . . . . . . . . . . . . . 12 2.2 ComputerVisionApproachesforSpeechProcessing . . . . . . . . . . . 18 2.2.1 FaceDetection . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 LandmarkLocalisation . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.3 VisualSpeechProcessing . . . . . . . . . . . . . . . . . . . . . 29 2.2.4 ComputerVisionApproachesforSpeechProcessing-Summary . 35 2.3 FeatureMatchingandSequenceAlignment . . . . . . . . . . . . . . . . 35 2.3.1 SpeechtoTextAlignment . . . . . . . . . . . . . . . . . . . . . 35 2.3.2 AutomaticSpeechAlignment . . . . . . . . . . . . . . . . . . . 38 2.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3 DetectingSpeechinEntertainmentAudio 43 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 MachineLearningTechniquesforVoiceActivityDetection . . . . . . . . 46 3.3.1 Sonnleitneretal. . . . . . . . . . . . . . . . . . . . . . . . . . . 46 v 3.3.2 MFCCCross-CovarianceFeatures . . . . . . . . . . . . . . . . . 48 3.3.3 EvaluationDesign . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.4 EvaluationResults . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 ExperimentalDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5 InitialInvestigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5.1 Sonnleitner VAD . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5.2 MFCC-CCVAD . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.6 ComparisonwithContemporaryandStateoftheArtApproaches . . . . . 63 3.7 SixFilmCross-ValidationInvestigation . . . . . . . . . . . . . . . . . . 65 3.8 Non-EnglishSpeechTests . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4 VisualSpeechProcessing 69 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3 FeatureExtractionandSelection . . . . . . . . . . . . . . . . . . . . . . 71 4.3.1 LandmarkFeatures . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3.2 TwoDimensionalDiscreteCosineTransforms . . . . . . . . . . 73 4.3.3 FeatureSelectionViaAudio-VisualSpeechCorrelation . . . . . . 74 4.4 VisualVoiceActivityDetection . . . . . . . . . . . . . . . . . . . . . . 75 4.4.1 FeatureExtraction . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4.2 ExperimentalDesign . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4.3 SpeakerDependentResults . . . . . . . . . . . . . . . . . . . . . 79 4.4.4 SpeakerIndependentResults . . . . . . . . . . . . . . . . . . . . 83 GenderBalancedDataset . . . . . . . . . . . . . . . . . . 89 4.4.5 NaturalSpeechDatasetResults . . . . . . . . . . . . . . . . . . 91 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5 LanguageIndependentFeatureMatchingandAlignment 94 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.2 DataRepresentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3 AnchorPointDetectionandSignalSegmentation . . . . . . . . . . . . . 98 5.3.1 AnchorPointClustering . . . . . . . . . . . . . . . . . . . . . . 100 5.3.2 AudiotoTextAssociation . . . . . . . . . . . . . . . . . . . . . 102 5.4 AudiotoTextAssociationofWholeFilmContent . . . . . . . . . . . . . 103 5.4.1 StartPointAlignment . . . . . . . . . . . . . . . . . . . . . . . 104 vi 5.4.2 AnchorPointEvaluation . . . . . . . . . . . . . . . . . . . . . . 104 5.4.3 SegmentMatching . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.4.4 ScaleCoefficientEstimation . . . . . . . . . . . . . . . . . . . . 106 5.4.5 TranscriptAlignmentandMatching . . . . . . . . . . . . . . . . 109 5.5 Scene-LevelAlignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5.1 Segment-BasedAlignment . . . . . . . . . . . . . . . . . . . . . 112 5.5.2 AnchorPointEvaluation . . . . . . . . . . . . . . . . . . . . . . 114 5.5.3 SegmentMatching . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.5.4 ScaleCoefficientEstimation . . . . . . . . . . . . . . . . . . . . 117 5.5.5 IncorporatingVisualFeatures . . . . . . . . . . . . . . . . . . . 118 5.6 ImprovingGeneralAlignmentThroughIncrementalScene-LevelAlignment121 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6 ConclusionsandFutureWork 126 6.1 ApplicationContexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.1.1 AutomaticContentSegmentation . . . . . . . . . . . . . . . . . 127 6.1.2 SubtitleValidation . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.1.3 EnhancementofAutomaticTranscriptionMethods . . . . . . . . 128 6.1.4 ImprovingADRThroughAV-VAD . . . . . . . . . . . . . . . . 128 6.1.5 Pre-ProcessingforLanguageIndependentAlignment . . . . . . . 128 6.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 vii List of Figures 1.1 Diagramofspeechactivity-basedtext-to-audioalignment. . . . . . . . . 5 2.1 Spectrogramsofspeech(a)andmusic(b)contentillustratingthedifference inharmonicpatterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 IllustrationofHarr-likerectangularfeatures. . . . . . . . . . . . . . . . . 19 2.3 IllustrationofLBPfeaturecomputation. . . . . . . . . . . . . . . . . . . 20 2.4 IllustrationofLBPinvariancetoilluminationconditions. . . . . . . . . . 21 2.5 ExampleofHOGfeatures. Left: inputimage. Right: HOGfeatures. . . . 22 2.6 FlowdiagramofCLMsearchalgorithm. . . . . . . . . . . . . . . . . . . 27 2.7 Exampleofapproachfrom[113]’sperformanceonpartiallyoccludeddata. 27 2.8 Exampleofapproachfrom[66]’sperformanceonpartiallyoccludeddata. 29 2.9 Diagrampaththroughcostmatrixmappingquerysignaltoreferencesignal producedbyDTW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.10 Illustrationofsmoothingprocessfrom[116]. . . . . . . . . . . . . . . . 41 3.1 DiagramofMel-scalefilterbank. . . . . . . . . . . . . . . . . . . . . . . 51 3.2 MatrixofMFCCpaircorrelationcoefficientdifferencesbetweenspeech andnon-speechdata. Darkersquaresindicategreatervalues. . . . . . . . 52 3.3 Randomforestclassificationresultsusingarangeofestimators . . . . . . 55 3.4 RandomforestclassificationresultsusingarangeofMFCC-CCfeatures . 55 3.5 AccuracyscoresforlinearkernelSVMoverarangeofC values. . . . . . 56 3.6 F-scoresforlinearkernelSVMoverarangeofC values. . . . . . . . . . 56 3.7 HeatmapofaccuracyscoresfromSVMgridsearchwithpolynomialkernel SVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.8 HeatmapofaccuracyscoresfromSVMgridsearchwithRBFkernelSVM using5MFCC-CCfeatures. . . . . . . . . . . . . . . . . . . . . . . . . 58 3.9 RBF kernel SVMperformance over a range ofMFCC-CC features using parametersC = 1.0andγ = 0.001. . . . . . . . . . . . . . . . . . . . . 58 viii

Description:
SFD - Spectral Flatness Detection. SNR - Signal to Noise Ratio. SOLA - Synchronised Overlap Add. STFT - Short Time Fourier Transform. SVM - Support Vector Machine. TN - True Negative. TP - True Positive. VAD - Voice Activity Detection. V-VAD - Visual Voice Activity Detection. VoIP - Voice Over IP.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.