ebook img

SENTIMENT MINING OF ARABIC TWITTER DATA by Soha Galalaldin Khider Ahmed A Thesis PDF

106 Pages·2014·2.84 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview SENTIMENT MINING OF ARABIC TWITTER DATA by Soha Galalaldin Khider Ahmed A Thesis

SENTIMENT MINING OF ARABIC TWITTER DATA by Soha Galalaldin Khider Ahmed A Thesis Presented to the Faculty of the American University of Sharjah College of Engineering in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Engineering Sharjah, United Arab Emirates January 2014 © 2014 Soha Galalaldin Khider Ahmed. All rights reserved. 2 Approval Signatures We, the undersigned, approve the Master’s Thesis of Soha Galalaldin Khider Ahmed. Thesis Title: Sentiment Mining of Arabic Twitter Data Signature Date of Signature (dd/mm/yyyy) ___________________________ _______________ Dr. Michel Pasquier Associate Professor, Department of Computer Science and Engineering Thesis Advisor ___________________________ _______________ Dr. Ghassan Qaddah Associate Professor, Department of Computer Science and Engineering Thesis Co-Advisor ___________________________ _______________ Dr. Khaled El Fakih Associate Professor, Department of Computer Science and Engineering Thesis Committee Member ___________________________ _______________ Dr. Ashraf Elnagar Professor, Department of Computer Science, University of Sharjah Thesis Committee Member ___________________________ _______________ Dr. Assim Sagahyroon Head Department of Computer Science and Engineering ___________________________ _______________ Dr. Hany El-Kadi Associate Dean College of Engineering ___________________________ _______________ Dr. Leland Thomas Blank Dean College of Engineering ___________________________ _______________ Dr. Khaled Assaleh Director of Graduate Studies 3 Acknowledgments Foremost, I would like to thank Allah for giving me health, patience and strength to write this thesis and for all the graces He has granted me. I would like to thank my supervisors Dr. Michel Pasquier and Dr. Ghassan Qaddah for advising me during my graduate study years. Thank you very much for your patience, guidance and encouragement. I learnt from you how to be a real researcher and how to think differently. I would also like to thank all my professors and friends at AUS specially my friend Kouloud Safi. Finally, I dedicate this thesis to my family who has always supported me in my studies and life. Without your love, care and patience, I would not have achieved this. Special thanks go to my father for believing in me and supporting me in my life and education. My father believed that education was the ticket to success. He particularly wanted me to be a professor and I will try hard to make his and my dream true. I am also grateful to my brothers, Mohamed, Yousif, Khider and Omer and my dear sister Line and my fiancé Osman for their help and continuous support. Special dedication of this thesis goes to the most beloved person: Mum. Thank you for your patience, care and everything you have done to keep our family gathered in peace and happiness. Thank you for giving us the love we need to survive in this life. I will always love you Mum. 4 To my beloved parents, brothers and sisters for their endless love and support 5 Abstract Social networking services such as Facebook and Twitter and social media hosting websites such as Flickr and YouTube have become increasingly popular in recent years. One key factor to their attractiveness worldwide is that these sites and services allow people to express and share their opinions, likes, and dislikes, freely and openly. The opinions posted range from criticizing politicians to discussing football matches, citing top news, appraising movies, and recommending new products and services such as mobiles, restaurants, and software. This development has fueled a new field known as sentiment analysis and opinion mining with the goal of extracting people’s sentiment from text to assist customers in their purchase decisions and vendors in enhancing their reputation. This emerging field has attracted a large research interest, but most of the existing work focuses on English text. Hence, in this thesis, we studied sentiment analysis of Arabic text retrieved from a well-known social media site, namely Twitter. Specifically, we studied the topic of target- dependent sentiment analysis of Arabic Twitter text, which has not been addressed in Arabic language before. We developed a system that will acquire Arabic text from Twitter and extract users’ opinions towards different topics and products. Key phases of the system are as follows. In the Data Acquisition phase, we collected tweets from Twitter related to specific topics. In the Tweet-Filtering phase, we reduced the noise in the collected tweets data to facilitate the Annotation phase, in which we annotated the collected tweets depending on the specified topic. In the Data Preprocessing phase, we added tags, normalized the words used in tweets, and removed spam tweets. In the Feature identification phase, we extracted stylistic, syntactic, and semantic features, and selected those yielding better results using features selection algorithms. In the Classification phase, the decision to annotate the tweets as negative, positive, or neutral towards a specific topic was made using a trained machine- learning algorithm. Results from different feature sets, classifiers, and datasets are reported in terms of classification accuracy, Kappa statistic, and F-measure. Search Terms: Arabic Social Media, Machine Learning, Semantic Features, Sentiment Analysis, Stylistic Features, Syntactic Features, Text Preprocessing. 6 Table of Contents Approval Signatures....................................................................................................... 3 Acknowledgments.......................................................................................................... 4 Abstract .......................................................................................................................... 6 List of Figures .............................................................................................................. 11 List of Tables ............................................................................................................... 12 1. Introduction .......................................................................................................... 15 1.1 Background ................................................................................................... 15 1.2 Problem Statement ........................................................................................ 16 2. Literature Review ................................................................................................. 20 2.1 Basic Taxonomy ............................................................................................ 20 2.2 Sentiment Classification Tasks ..................................................................... 20 2.2.1 Data preprocessing. ................................................................................ 20 2.2.2 Class labeling. ........................................................................................ 22 2.2.3 Annotation granularity. .......................................................................... 23 2.2.4 Source and target identification. ............................................................ 23 2.3 Sentiment Classification Features ................................................................. 26 2.3.1 Syntactic features. .................................................................................. 26 2.3.2 Semantic features. .................................................................................. 26 2.3.3 Stylistic features. .................................................................................... 27 2.4 Sentiment Classification Techniques ............................................................ 28 2.4.1 Machine learning. .................................................................................. 28 2.4.2 Link analysis. ......................................................................................... 29 2.4.3 Score-based methods. ............................................................................ 29 2.5 Sentiment Classification Input Domain ........................................................ 30 2.5.1 Reviews. ................................................................................................. 30 2.5.2 Web discourse. ....................................................................................... 30 2.5.3 News articles. ......................................................................................... 31 2.5.4 Social media websites. ........................................................................... 31 2.6 Arabic Language Sentiment Classification ................................................... 31 2.7 Challenges and Issues.................................................................................... 33 7 2.7.1 Difficulties inherent to sentiment analysis. ............................................ 34 2.7.2 Difficulties related to the Arabic language. ........................................... 35 2.7.3 Difficulties pertaining to social media text. ........................................... 36 2.8 Discussion and Direction .............................................................................. 37 3. System Description ............................................................................................... 39 3.1 ASSA Model Generation .............................................................................. 39 3.1.1 Data acquisition phase. .......................................................................... 40 3.1.2 Tweet-filtering phase. ............................................................................ 40 3.1.3 Annotation phase. .................................................................................. 40 3.1.4 Data preprocessing phase. ...................................................................... 40 3.1.5 Feature identification phase. .................................................................. 40 3.1.6 Classification phase. .............................................................................. 41 3.2 ASSA System Deployment ........................................................................... 41 3.2.1 Data acquisition phase. .......................................................................... 41 3.2.2 Data preprocessing phase. ...................................................................... 42 3.2.3 Feature identification phase. .................................................................. 42 3.2.4 Classification phase. .............................................................................. 42 4 Data Acquisition and Annotation Phases ............................................................. 43 4.1 Data Acquisition Phase ................................................................................. 43 4.1.1 Twitter fetcher module. .......................................................................... 43 4.2 Tweet-Filtering Phase ................................................................................... 45 4.3 Annotation Phase........................................................................................... 48 5 Data Preprocessing Phase ..................................................................................... 51 5.1 Data Cleaning ................................................................................................ 51 5.1.1 Tag adder module. ................................................................................. 52 5.1.2 Normalization module. .......................................................................... 52 5.1.3 Spam detection module. ......................................................................... 55 5.2 Natural Language Processing Tasks ............................................................. 59 5.2.1 Stemming. .............................................................................................. 59 5.2.2 Part-of-speech tagging. .......................................................................... 61 6 Feature Identification ............................................................................................ 63 6.1 Features extraction ........................................................................................ 63 6.1.1 Syntactic features. .................................................................................. 63 8 Segmentation. ....................................................................................... 63 Punctuation. .......................................................................................... 64 6.1.2 Semantic features. .................................................................................. 64 6.1.3 Stylistic features. .................................................................................... 65 Word frequency. ..................................................................................... 65 Other stylistic attributes ......................................................................... 67 6.2 Non-lexical Features Extraction Experiments............................................... 67 6.2.1 First experiment. .................................................................................... 68 6.2.2 Second experiment. ................................................................................ 68 6.2.3 Third experiment. ................................................................................... 70 6.2.4 Fourth experiment. ................................................................................. 70 6.2.5 Fifth experiment. .................................................................................... 72 6.2.6 Sixth experiment. ................................................................................... 74 6.3 Features selection .......................................................................................... 75 6.3.1 N-gram selection. ................................................................................... 76 6.3.2 Word frequency selection. ..................................................................... 76 6.3.3 Lexical and non-lexical features selection. ............................................ 76 6.3.4 Three-way classification features selection. .......................................... 78 6.3.5 Subjectivity classification features selection. ........................................ 79 6.3.6 Polarity classification features selection. ............................................... 80 7 Classification Phase .............................................................................................. 84 7.1 Three-way Classification Accuracy Using Blind Testing ............................. 84 7.2 Subjectivity Classification Accuracy Using Blind Testing ........................... 85 7.3 Polarity Classification Accuracy Using Blind Testing ................................. 86 8 System Deployment .............................................................................................. 88 8.1 Data Acquisition Phase ................................................................................. 88 8.2 Data Preprocessing Phase.............................................................................. 89 8.3 Feature Identification Phase .......................................................................... 89 8.4 Classification Phase....................................................................................... 89 9 Conclusion ............................................................................................................ 90 9.1 Achievements ................................................................................................ 90 9.2 Limitations .................................................................................................... 91 9.3 Recommendations and Future Work ............................................................. 92 9 10 References ......................................................................................................... 94 Appendix .................................................................................................................... 101 Twitter API ............................................................................................................. 101 YouTube API ......................................................................................................... 102 Regular Expressions ............................................................................................... 103 Sample of the Arabic Polarity Lexicon .................................................................. 104 Sample of Named Entity Dictionary ...................................................................... 105 11 Vita .................................................................................................................. 106 10

Description:
supervised traditional pattern based classification, supervised machine learning sequential labeling, and a hybrid of the two methods. Unfortunately
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.