Saarland University Department of Language Science Technology Computational Linguistics and Phonetics Bachelor thesis Sentiment and Emotion Movie Script Annotation Tatiana Anikina 2547420 [email protected] submitted 14.06.2017 Supervisor Prof. Dr. Dietrich Klakow Advisors Dr. Jannik Strötgen Dr. Paramita Paramita Reviewers Prof. Dr. Dietrich Klakow Prof. Dr. Gerhard Weikum Statement Hereby I confirm that this thesis is my own work and that I have documented all sources used. Saarbrücken, 14.06.2017 Tatiana Anikina Declaration of consent Herewith I agree that my thesis will be made available through the library of the Department of Language Science Technology Computational Linguistics and Phonetics. Saarbrücken, 14.06.2017 Tatiana Anikina Contents Contents i List of Figures ii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Goals of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . 2 2 Basic Concepts and Related Work 5 2.1 Subjectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Sentiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Movie Scripts 23 3.1 Movie Script Structure . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Movie Script Data . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Movie Script Sentiment and Emotions . . . . . . . . . . . . . 27 4 Methodology 29 4.1 Subjectivity Detection . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Sentiment Detection . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 Emotion Detection . . . . . . . . . . . . . . . . . . . . . . . . 38 5 Evaluation 47 5.1 Gold Standard Data . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2 Subjectivity Detection . . . . . . . . . . . . . . . . . . . . . . 48 5.3 Sentiment Detection . . . . . . . . . . . . . . . . . . . . . . . 57 5.4 Emotion Detection . . . . . . . . . . . . . . . . . . . . . . . . 64 6 Discussion and Future Work 77 Bibliography 79 Appendices 87 i Appendix I 87 General Annotation Guidelines . . . . . . . . . . . . . . . . . . . . 87 Appendix II 89 Processing Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Appendix III 93 Annotations for Movie Script Chinatown (1974) . . . . . . . . . . 93 List of Figures 3.1 Movie Script Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Movie Genre Distribution . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 Subjectivity Detection Pipeline . . . . . . . . . . . . . . . . . . . . 29 4.2 Sentiment Detection Pipeline . . . . . . . . . . . . . . . . . . . . . 35 4.3 Emotion Detection Pipeline . . . . . . . . . . . . . . . . . . . . . . 38 ii 1 CHAPTER Introduction 1.1 Motivation Sentiment analysis and emotion classification play an important role in many natural language processing applications. They help to detect the attitude of the speaker or writer and give useful insights for social media analysis. This work focuses on the movie domain where emotions and sentiment are crucial. Analyzing movie scripts and annotating them with sentiment and emotions could be useful for a wide range of applications. Movie recommen- dation systems and automatic search for the emotional scenes are just some of them. Instead of searching for movies by genre, a recommendation system with emotion annotations could offer similar movies and scenes based on their emotion and sentiment patterns even if the genre is different. Another scenario for the search engine could be the following: a person whowantstofindamovieanddoesnotrememberthetitlebuthasaparticular scene in mind. Because most people tend to remember emotionally intense events, there is a high probability that this person remembers one of the emotional scenes. A collection of movie scripts annotated with emotional scenes could be very useful in this case. Given emotions and some extra information (e.g. number of characters involved, time of the day and location) only relevant scenes from the movies can be looked up and presented to the user. Movie scripts include different types of texts. On the one hand, they have narrative descriptions of events happening in the movie which are mostly written in a formal language. On the other hand, they have dialogues which may use informal spoken language. This means that different approaches for the sentiment and emotion detection could be useful for different parts of the movie scripts. In this work dialogues and descriptive texts are treated as two different kinds of texts. 1 CHAPTER 1. INTRODUCTION 1.2 Goals of the Thesis Early works on sentiment analysis mainly rely on rule-based systems with polarity lexicon. However, in the last years a lot of research has been done on sentiment classification using machine learning techniques. The common way, called supervised learning, is to collect annotated data which can be used as a trainingsetandapplythemodellearnedfromthesedatatothenew(previously unseen) instances. This method is supposed to be less rigid than rule-based approach and may achieve considerably high levels of precision and recall if trained with high quality annotated data from the same or, at least, a similar domain. However, annotated resources do not exist for many domains and their collection costs time and requires expert knowledge. The movie domain does not have annotated data with sentiment and emotion labels which could be directly used for the training. The major goal of the current work was to implement a system which tries differentclassifiersandtrainingdataondescriptive(meta)textsanddialogues from the movie scripts. Sentiment analysis was performed with and without prior subjectivity detection. Subjective sentences usually convey polarity and therefore restrict the input of the sentiment classifier to the positive and negative instances, discarding the neutral ones. This turns sentiment classification into a binary decision task which is usually more robust and easier to implement compared to the multi-class approach. Subjectivity detection was performed using two methods: Support Vector Machines(SVM)andfastText(linearclassificationmodelwithrankconstraints). Both classifiers were trained on the annotated data from the blog and news domains. The results of the subjectivity detection were combined with the sentimentannotationdonebytheCoreNLPandfastTextclassifiers,whichwere trained on the news and movie review domains. Emotion detection was performed using a rule-based approach, a multi- classSVMclassifier,fastTextandOpenNLP.Allofthemexceptfortherule-based system were trained using Twitter messages, news headlines and fairy-tales annotated with emotion labels. 1.3 Structure of the Thesis This work is divided into six chapters. After the short introduction of the research topic some basic concepts and related work are discussed in chapter 2. This chapter also gives a short overview of various approaches to subjectiv- ity, sentiment and emotion detection together with a list of commonly used resources and data sets. Chapter 3 describes the movie script domain and provides some statistics for the data used in the current project. It also introduces the object model implemented for the movie scripts. 2 1.3. STRUCTURE OF THE THESIS Chapter 4 describes the core part of the project and gives some implemen- tation details for the subjectivity, sentiment and emotion detection pipelines. Finalresultsandevaluationofclassifierperformanceondifferentdatawith various configurations are shown in chapter 5. The last part of the thesis (chapter 6) includes some ideas regarding future work and possible extensions of the current pipeline. 3
Description: