Unsupervised Alignment of Natural Language with Video by Iftekhar Naim Submitted in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Supervised by Professor Daniel Gildea Department of Computer Science Arts, Sciences and Engineering Edmund A. Hajim School of Engineering and Applied Sciences University of Rochester Rochester, New York 2015 ii Tomygrandparents iii Biographical Sketch Iftekhar Naim completed his Bachelor of Science degree in Computer Science in 2007 from Bangladesh University of Engineering and Technology (BUET). From 2007- 2008,heworkedasaSoftwareEngineeratSpectrumEngineeringConsortium,Dhaka, Bangladesh. In 2008, he moved to Rochester, NY, and completed a Master of Sci- ence degree in Electrical and Computer Engineering from the University of Rochester. He joined the PhD program in the Computer Science Department at the University of Rochester in 2011, under the supervision of Prof. Daniel Gildea. He did several re- searchinternshipsatBoschResearchandGoogleInc. HeisastudentmemberofIEEE, AAAI,andACL. iv Acknowledgments First and foremost, I would like to thank my advisor Daniel Gildea, who has always been kind and supportive. I have learned so many things from Dan, both regarding academic research and personal values. Despite being busy with his own research and teaching,Danalwaysprovidedmeplentyoftime,whetherit’sregardingwritingpapers, figuring out mathematical details of my models, or even providing insights regarding possible bugs in my code. Most importantly, Dan always encouraged me to pursue my ideas and guided me to remain on the right track. I hope to follow his examples and treat my future colleagues and collaborators with the same level of sincerity and commitment. I would like to thank Henry Kautz for introducing me to this exciting domain of ‘language and vision’ and for including me in the wetlab project. It was a great plea- sure working with Henry and making progress with the project, which became the primary focus of my dissertation. I am also grateful to Henry for believing in me and forintroducing methetomanygreatresearchersinmyarea. IfeelfortunatetohaveEhsanHoquejoiningourdepartmentatthebeginningofthe third year. Ehsan has been an incredible mentor. From Ehsan, I learned to step back from the technical details from time to time, and to look at the bigger picture. Ehsan has a great quality of inspiring people around him, and I always felt more energized and motivated after our meetings. I am also thankful to Ehsan for providing me the opportunities togiveguestlecturesinhisclasses. v IwouldliketothankmyexternalcommitteemembersStevePiantadosiandRobert Jacobsforalwaysbeingaccommodatingwiththeirtimeandforprovidingtheirvaluable inputregardingmyresearch. IamalsogratefultoJieboLuo,LiangHuang,andJeffery BighamfortheircommentsandsuggestionsindifferentphasesofmyPhDstudies. I have been fortunate to collaborate with several graduate students in our depart- ment. I would like to thank Young Chol Song and Qiguang Liu for helping me with video processing and contributing to the papers that we wrote together. I am grateful to Iftekhar Tanveer, not only for his help with the job interview project, but also for many fun conversations that we regularly had in the coffee shops around campus. I would like to thank Walter Lasecki for all his help with the crowd captioning project. MyspecialthanksgoestoAbdullahAlMamun,anamazingundergraduatestudentand my research collaborator, for his hard work and commitment to AI research. I am also gratefultotheamazingstaffmembers,especiallyMarty,Pat,JoMarie,Eileen,andNiki, myfellowgraduatestudents,andthefacultymembersatURCS,whoalwaystreatedme kindly. My life at Rochester became so much more eventful and memorable for my amaz- ing friends, who made Rochester my second home. I am incredibly grateful to my Rochester family, especially Naushad, Talat, Ashker, Farzana apu, Zobayer bhai, Akhi bhabi, Naseef, Pappu, Nishi, Juni, Tonima, Tasnif, Tonmoy, Anis, Towhid, and Tousif. I would also like to thank my friends and collaborators at the University of Rochester: Amal,Phyo,Rahman,Roya,Pencheng,Xiaochang,Licheng,Mariyam,Tag,Lingfeng, Nasrin, Omid, Adam, and my past lab-mates Chao, Orhan, Carlos, Arif, Basak, and Andy. I am also thankful to my best friends, especially Enamul, Tanima, Shafi, Sagar, Laboni,Shahan,andAtifforallthefuntimeswespenttogether. Finally,Iwouldliketothankmylovingfamilyfortheirnever-endingsupport. Iam extremely thankful to my mother, Dr. Nayeema Kabir, for always being there for me. Eventhoughwehavebeenlivingondifferentcontinents,shehasalwaysbeeninformed abouteverylittledetailsofmygradlifeandalwaysinspiredandsupportedme. Iwould vi like to thank my wonderful sister, Anisia, and my brilliant nephew, Ayman, who have been living with me for the last year and sharing many moments of happiness. I am gratefultoMaa,Baba,Nanu,Rinukhala,andBenookhalafortheirunconditionallove andtoNitu,Tasbir,andEshaanforbeingacontinuous sourceofjoy. Iamfortunatetohaveawonderfulwife,Shantonu,whohasbeenextremelyloving, caring, and supportive in every possible way. The happiest period of my grad life was thefirsttwoyears,whenshelivedinRochesterwithme. Eventhough shehadtoleave Rochester for her job, we continued sharing every moment of happiness and sadness, celebrated every success together, and supported each other in the times of failure. I amsohappythatwewillbetogetheragainaftermygraduation. I would like to acknowledge my grandmother and my late grandfather, to whom I dedicate my thesis. They have been the greatest gifts in my life, and I regret not spending more time with them. Even though my grandfather passed away, I feel his presenceinthepagesofhisbooksandinmyrandomthoughts. Ihopetorememberthe valuesthattheytaughtmeandleadanhonestandsimplelifelikethem. vii Abstract Todayweencounterlargeamountsofvideodata,oftenaccompaniedwithtextdescrip- tions (e.g., cooking videos and recipes, videos of wetlab experiments and protocols, movies and scripts). Extracting meaningful information from these multimodal se- quences requires aligning the video frames with the corresponding sentences in the text. Previous methods for connecting language and videos relied on manual anno- tations, which are often tedious and expensive to collect. In this thesis, we focus on automatically aligning sentences with the corresponding video frames without any di- recthumansupervision. We first propose two hierarchical generative alignment models, which jointly align each sentence with the corresponding video frames, and each noun in a sentence with the corresponding object in the video frames. Next, we propose several latent-variable discriminative alignment models, which incorporate rich features involving verbs and videoactions,andoutperformthegenerativemodels. Ouralignmentalgorithmsarepri- marilyappliedtoalignbiologicalwetlabvideoswithtextinstructions. Furthermore,we extend our alignment models for automatically aligning movie scenes with associated scriptsandlearningword-leveltranslationsbetweenlanguagepairsforwhichbilingual training dataisunavailable. Thesis: By exploiting the temporal ordering constraints between video and associ- ated text, it is possible to automatically align the sentences in the text with the corre- spondingvideoframeswithout anydirecthumansupervision. viii Contributors and Funding Sources This work was supervised by a dissertation committee consisting of Professors Daniel Gildea, Henry Kautz, and M. Ehsan Hoque from the Department of Computer Science and Professors Steve Piantadosi and Robert Jacobs from the Department of Brain and Cognitive Science of the University of Rochester. In Chapter 3 and Chapter 4, Young Chol Song and Qiguang Liu helped with video processing, segmentation, and track- ing. Professor Jiebo Luo provided many helpful suggestions for all the tasks related to computer vision. Professor Liang Huang helped with his valuable comments and sug- gestionsforthediscriminativealignmenttask(Chapter4). AbdullahAlMamunhelped withannotatinggroundtruthlabelsformovietracks,andprovidedgreatsupportonthe movie-to-script alignment project (Chapter 5). Md. Iftekhar Tanveer and Leon Wein- gard helped with analyzing the job interview dataset. I also collaborated with Walter Lasecki, Mohammad Kazemi, and Jeffrey Bigham for the real-time crowd-captioning project. All other work conducted for the dissertation was completed by the student independently. Work presented here was supported by NSF grants IIS-1446996, IIS- 1449278,IntelISTC-PC,andONRN00014-11-10417. ix Table of Contents BiographicalSketch iii Acknowledgments iv Abstract vii ContributorsandFundingSources viii ListofTables xii ListofFigures xiv 1 Introduction 1 1.1 MotivationandOverview . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 SummaryofPrimaryContributions . . . . . . . . . . . . . . . . . . . . 4 1.3 OtherRelevantContributions . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 ThesisOutline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 GroundedLanguageLearning: Background 12 2.1 GroundedLanguageLearningforConnectingLanguagewithVision . . 12 2.2 ExistingResearchonGrounded LanguageLearning . . . . . . . . . . . 13 x 2.3 FullySupervisedApproaches . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Semi-supervisedApproaches . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 OurContribution: UnsupervisedGroundedLanguageLearning . . . . . 19 3 GenerativeAlignmentModels 25 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 AligningWetlabProtocolswithVideos . . . . . . . . . . . . . . . . . . 26 3.3 ProblemFormulationandNotations . . . . . . . . . . . . . . . . . . . 30 3.4 GenerativeModelsforJointAlignment . . . . . . . . . . . . . . . . . 31 3.5 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.6 RelatedWorksandDiscussions . . . . . . . . . . . . . . . . . . . . . . 39 3.7 FutureDirections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4 DiscriminativeAlignmentModels 43 4.1 MotivationandOverview . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 DiscriminativeAlignment . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 FeatureDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.5 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.6 DiscussionsandFutureWork . . . . . . . . . . . . . . . . . . . . . . . 58 5 AligningMovieswithScripts 62 5.1 OverviewandMotivation . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3 DataProcessingPipeline . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.5 ConclusionandFutureWork . . . . . . . . . . . . . . . . . . . . . . . 82
Description: