ebook img

Generating Natural Language Summaries for Image Sets PDF

39 Pages·2017·8.39 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Generating Natural Language Summaries for Image Sets

Generating Natural Language Summaries for Image Sets by Akash Abdu Jyothi Dual Degree (B. Tech. and M.Tech.), Indian Institute of Technology Madras, 2013 Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in the School of Computing Science Faculty of Applied Sciences © Akash Abdu Jyothi 2018 SIMON FRASER UNIVERSITY Summer 2018 Copyright in this work rests with the author. Please ensure that any reproduction or re-use is done in accordance with the relevant national copyright legislation. Approval Name: Akash Abdu Jyothi Degree: Master of Science (Computing Science) Title: Generating Natural Language Summaries for Image Sets Examining Committee: Chair: Parmit Chilana Assistant Professor Greg Mori Senior Supervisor Professor Anoop Sarkar Supervisor Professor Maxwell Libbrecht Internal Examiner Assistant Professor Date Defended: May 31, 2018 ii Abstract We address the problem of summarizing an image set with a natural language caption. We present PlacesCap, a new dataset for image set summarization. Our dataset consists of 11,661 image sets with a total of 116,113 images, where each set is summarized by a 3 sen- tence caption. We propose novel pooling operators for permutation invariant sets of feature maps, and empirically evaluate image set summarization models based on those operators. We also conduct experiments of image set classification and show competitive performance for the proposed set pooling operators. Keywords: image set summarization; natural language summary generation; set compres- sion; set pooling iii Acknowledgements First and foremost, I would like to thank my supervisor Dr. Greg Mori for envisioning this project, and providing constant support and encouragement throughout its course. His guidancewascriticalforcompletionofthisworkandimmenselyinstructiveformylearning. I am deeply grateful to Fred for his involvement in this project from the very beginning. He was the go-to person whenever I was stuck with a problem, and his advice always cleared things up for me. I would like to thank Thibaut for his active participation in the project with many stimulating discussions, and his experiments that helped us understand the dataset better. I would also like to thank Dr. Anoop Sarkar and Dr. Leonid Sigal for the many insightful discussions over this project. IwouldliketothanktoDatapurefortheirpromptdataannotationserviceswhichhelped us create our dataset without major delays. IamgratefultomyfriendsNagender,Sreenath,Josh,Srikanth,Nishant,Pratik,Aniket, Sha, Nazanin and many others for keeping me grounded during the difficult process of growing up. To the authors of all the books that I read in the last two years, I can’t name allofyouherebutthanksaton.Forshowingmethepossibilitiesofbeingthroughwonderful music that I barely understand, I am forever indebted to Mr. Johann Sebastian Bach. I would like to express my gratitude to my parents, who always supported me in my endeavours whether they saw any sense in it or not. Finally, I would like to thank my sister Sangeetha for her proactive support and invaluable guidance in life, the universe and everything. iv Table of Contents Approval ii Abstract iii Acknowledgements iv Table of Contents v List of Tables vii List of Figures viii 1 Introduction 1 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Related Work 4 2.1 Image set summarization as selecting exemplars . . . . . . . . . . . . . . . . 4 2.2 Summarization of photo streams . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Visual captioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4 Visual Paragraph Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Dataset 8 3.1 Image Set Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Image Set Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 Models 11 4.1 Set Summarization Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.2 Pooling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.2.1 Set Aware Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2.2 Set Mask Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2.3 Set Weight Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2.4 Generalized Mean Pooling . . . . . . . . . . . . . . . . . . . . . . . . 15 5 Experiments and Results 16 v 5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.1.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.1.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.1.3 Evaluation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.2 Experiment on Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.3 Dataset Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.4.1 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.5 Analysis of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 6 Conclusion 24 6.1 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Bibliography 25 Appendix A Labelling Guidelines for Annotators 28 vi List of Tables Table 3.1 Distribution of categories in PlacesCap dataset . . . . . . . . . . . . . 9 Table 5.1 Analysisofpredictedcaptionsonthevalidationsetthathas2332image sets.Learningratewassetto0.001.Thevaluesarereportedformodels from training epoch 18 (chosen based on peak performance of most of the models). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Table 5.2 Classification accuracy obtained using mean pooling and the proposed pooling methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Table 5.3 Summary of the results obtained using mean pooling and the proposed pooling methods. B-n = BLEU-n, Perp. = Perplexity, CE = Cross Entropy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 vii List of Figures Figure 1.1 A set of images showing a tourist location with a summary descrip- tion below . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Figure 2.1 Summary of a set of 2000 images of Vatican. Image from Simon et al. [22] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Figure 2.2 Figure from Kim et al. [12] showing (a) photo stream input from multiple users and (c) output storyline graph with time stamps. (b) Optionalinputoffriendshipgraphisutilizedforforcreatingweakly- personalized storylines by giving higher weightage for photo streams from close friends. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Figure 2.3 Figure from Xu et al. [29] depicting image captioning model with visual attention. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Figure 2.4 Figure from Krause et al. [14] depicting a hierarchical recurrent neu- ral network used for describing an image with a paragraph. . . . . . 7 Figure 3.1 Sampleimagesets,onefromeachcategoryinthedataset,alongwith annotated summary caption. . . . . . . . . . . . . . . . . . . . . . . 10 Figure 4.1 Schematic of the proposed set summarization model . . . . . . . . . 12 Figure 4.2 Schematic of the proposed set pooling operations . . . . . . . . . . 14 Figure 5.1 Results from the experiment where we use top 1, 2, 3, 4, 5 and all images respectively from the image sets for a summarization model that uses mean pooling of image features. . . . . . . . . . . . . . . 20 Figure 5.2 Language metrics and cross entropy on validation set for a mean pooling based model across training epochs for beam sizes 1(sam- pling), 3, 5 and 7. Higher values indicate better performance for CIDEr and BLEU-4. . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Figure 5.3 Summarypredictionexamplesfromtrainingset.GT=groundtruth summary, Pred = predicted summary. . . . . . . . . . . . . . . . . 22 Figure 5.4 Summary prediction examples from validation set. GT = ground truth summary, Pred = predicted summary. . . . . . . . . . . . . . 23 viii Chapter 1 Introduction Comprehending data collections on the internet is often difficult because of their sheer volume.Thesecollectionsmayconsistofimages(Flickrsets),audio(Spotifyplaylists),video (YouTube playlists) or any combination of these along with text (TripAdvisor, Airbnb, Amazon listings). Categorization and tagging of content is helpful but they fall short of capturing details that will enable fine search. Fig. 1.1 shows a motivational example of an image set of a tourist location, and its summary caption that captures details that may be of interest to a potential tourist. Many professional and scientific disciplines also currently require humans to manually sift through large collections of images to obtain the required information. For example, a search and rescue effort may involve volunteers combing through satellite imagery for signs of a missing person, a humanitarian disaster relief project may require volunteers to assess the extent of damage across a large area so that resources can be appropriately allocated, or a medical diagnosis may require a practitioner to gather information from across many image scans. Wefocusontheproblemofproducingmeaningfultextualoutput-summariesinnatural language - to summarize a large set of images. Compact set summaries will assist users to navigate easily to interesting content in large data collections, and enable users to tell at a glance whether particular sets are relevant to their goals. Automatic understanding of images is a common objective in computer vision. Most image understanding tasks, such as object detection, semantic segmentation, image cap- tioning, and visual question answering, take a single image as input and infer task-specific information from the image. For example, segmenting an image into its semantic categories, oransweringaquestionaboutasingleimage.Incontrast,imagesetsummarizationinvolves understanding an unstructured collection of images. We focus on summarization of image sets, though our model can be easily extended to summarize other input modalities by appropriate replacement of the image feature extraction module. Previous works on image set summarization [30, 24, 22, 23, 27] focus on selecting sum- mary images from the set. This style of summarization is useful in certain cases like sum- 1 "This is a sandy beach. The beach is crowded with many tourists. There are rock formations and grassy patches of land nearby." Figure 1.1: A set of images showing a tourist location with a summary description below marizing personal photo collections, but is of limited use in contexts where the image set may have a large variety of objects and topics to be summarized. A text based summary can overcome this limitation through abstractive summarization to create a compact and comprehensive description. Text summary also has more general applicability as discussed in the previous paragraphs. To address this task, we present a new dataset PlacesCap for image set summarization, and conduct experiments that motivate its utility for set summarization task. PlacesCap contains 11,661 image sets, each having top (upto 10) Google image search results of a popular tourist location. We manually annotate each set with a 3 sentence summary that describesmajorattractionsinthetouristlocation.Toourknowledge,thisisthefirstdataset of image sets annotated with natural language summary. Combining sets of items is a more generic operation that goes beyond the task of sum- marization. Aggregating a set of permutation invariant feature maps is typically achieved through pooling operations like mean or max. These operations aggregate information from each input without considering contents of the entire set. We propose parametrised set pooling operators that overcome this limitation. Our methods achieve set aware pooling, and learns parameters of the operator through backpropagation in an end-to-end fashion. In addition to evaluation in summarization task, we also test these new operators in image set classification task and show significant improvement in performance. 1.1 Contributions Our main contributions can be summarized as follows. • We present the novel task of image set summarization using natural language caption, and conduct experiments on summary caption generation. 2

Description:
Title: Examining Committee: Akash Abdu Jyothi. Master of Science (Computing Science). Generating Natural Language Summaries for. Image Sets.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.