RRoocchheesstteerr IInnssttiittuuttee ooff TTeecchhnnoollooggyy RRIITT SScchhoollaarr WWoorrkkss Theses 5-20-2013 UUssiinngg wwoorrdd aanndd pphhrraassee aabbbbrreevviiaattiioonn ppaatttteerrnnss ttoo eexxttrraacctt aaggee ffrroomm TTwwiitttteerr mmiiccrrootteexxttss Nathaniel Moseley Follow this and additional works at: https://scholarworks.rit.edu/theses RReeccoommmmeennddeedd CCiittaattiioonn Moseley, Nathaniel, "Using word and phrase abbreviation patterns to extract age from Twitter microtexts" (2013). Thesis. Rochester Institute of Technology. Accessed from This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. Using Word and Phrase Abbreviation Patterns to Extract Age From Twitter Microtexts Approved by Supervising Committee: Dr. ManjeetRege CommitteeChair,DepartmentofComputerScience Dr. CeciliaOvesdotterAlm Reader,DepartmentofEnglish Dr. ReynoldBailey, Observer,DepartmentofComputerScience Using Word and Phrase Abbreviation Patterns to Extract Age From Twitter Microtexts by Nathaniel Moseley THESIS Presented to the Faculty of the B. Thomas Golisano College of Computing and Information Sciences Department of Computer Science Rochester Institute of Technology in Partial Fulfillment of the Requirements for the Degree of Master of Science Rochester Institute of Technology May 20, 2013 (cid:13)c Copyright 2012 by Nathaniel Moseley All Rights Reserved iii Acknowledgments I wish to thank my friends and mother who provided valuable feedback on the reading of this documentthroughitsdevelopment,aswellasremindingmetokeepatthetaskofwriting. Iwould alsoliketothankmyparentsforsupportingmethroughoutmyeducation. I must extend gigantic thanks to my thesis committee, for their time spent proofreading this document and for invaluable input on the progress of my experimentation and ideas. This would havebeenmuchmoredifficultwithouttheirhelp. iv Abstract The wealth of texts available publicly online for analysis is ever increasing. Much work in com- putational linguistics focuses on syntactic, contextual, morphological and phonetic analysis on written documents, vocal recordings, or texts on the internet. Twitter messages present a unique challengeforcomputationallinguisticanalysisduetotheirconstrainedsize. Theconstraintof140 charactersoftenpromptsuserstoabbreviatewordsandphrases. Additionally,asaninformalwrit- ing medium, messages are not expected to adhere to grammatically or orthographically standard English. As such, Twitter messages are noisy and do not necessarily conform to standard writing conventions of linguistic corpora, often requiring special pre-processing before advanced analysis canbedone. In the area of computational linguistics, there is an interest in determining latent attributes of an author. Attributes such as author gender can be determined with some amount of success frommanysources,usingvariousmethods,suchasanalysisofshallowlinguisticpatternsortopic. Author age is more difficult to determine, but previous research has been somewhat successful at classifying age as a binary (e.g. over or under 30), ternary, or even as a continuous variable using varioustechniques. Twitter messages present a difficult problem for latent user attribute analysis, due to the pre- processing necessary for many computational linguistics analysis tasks. An added logistical chal- lenge is that very few latent attributes are explicitly defined by users on Twitter. Twitter messages areapartofanenormousdataset,butthedatasetmustbeindependentlyannotatedforlatentwriter attributes not defined through the Twitter API before any classification on such attributes can be done. Theactualclassificationproblemisanotherparticularchallengeduetorestrictionsontweet length. Previous work has shown that word and phrase abbreviation patterns used on Twitter can be indicative of some latent user attributes, such as geographic region or the Twitter client (iPhone, Android, Twitter website, etc.) used to make posts. This study explores if there there are age- related patterns or change in those patterns over time evident in Twitter posts from a variety of Englishlanguageauthors. This work presents a growable data set annotated by Twitter users themselves for age and other useful attributes. The study also presents an extension of prior work on Twitter abbreviation patterns which shows that word and phrase abbreviation patterns can be used toward determining user age. Notable results include classification accuracy of up to 82.6%, which was 66.8% above relative majority class baseline (ZeroR in Weka) when classifying user ages into 10 equally sized agebinsusingasupportvectormachineclassifierandPCAextractedfeatures. v Contents Page Acknowledgments iv Abstract v ListofTables viii ListofFigures ix 1 Introduction 1 2 Background 3 2.1 TwitterandtheCharacterofTweets . . . . . . . . . . . . . . . . . . . . . . . . . 4 3 RelatedWork 5 4 Hypothesis 8 5 Pre-ExperimentalDesignandImplementation 8 5.1 DataCollection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5.2 DemographicDataPre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.3 TweetLoading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.4 AbbreviationFeaturesandExtraction . . . . . . . . . . . . . . . . . . . . . . . . . 13 6 DataSetAnalysis 16 6.1 TweetInformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 6.2 DemographicInformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7 Experiments 25 7.1 InitialPilotExperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 7.1.1 ParameterSelectionExperiments . . . . . . . . . . . . . . . . . . . . . . . 30 7.1.2 GroupingExperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 7.1.3 BinningExperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 7.2 SelectedDataSetExperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 7.2.1 BooleanFeatureExperiments . . . . . . . . . . . . . . . . . . . . . . . . . 33 vi 7.2.2 NumericFeatureExperiments . . . . . . . . . . . . . . . . . . . . . . . . 33 7.2.3 N-gramFeatureExperiments . . . . . . . . . . . . . . . . . . . . . . . . . 35 7.2.4 BooleanandNumericFeatureExperiments . . . . . . . . . . . . . . . . . 37 7.2.5 BooleanandN-gramFeatureExperiments . . . . . . . . . . . . . . . . . . 38 7.2.6 NumericandN-gramFeatureExperiments . . . . . . . . . . . . . . . . . . 39 7.2.7 Boolean,Numeric,andN-gramFeatureExperiments . . . . . . . . . . . . 40 7.2.8 WithheldUsersExperiments . . . . . . . . . . . . . . . . . . . . . . . . . 41 7.3 AssociationMining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 7.4 LongitudinalAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 8 Conclusions 48 9 FutureWork 49 Appendices 55 AppendixA DemographicTables 55 Vita 57 vii List of Tables Page 1 Feature type names, examples, and occurrence rates compared to the work of Gouwsetal.[17] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Basicinformationaboutthepresentdataset . . . . . . . . . . . . . . . . . . . . . 18 3 Per-tweetminimum,maximum,andmeancountsfortypesoftweettokens . . . . . 20 4 Tweetexamplesfromthecollecteddataset . . . . . . . . . . . . . . . . . . . . . . 20 5 Featurenamesandthedifferenceofrelativepercentagesfoundinthecollecteddata setandtheworkofGouwsetal.[17] . . . . . . . . . . . . . . . . . . . . . . . . . 21 6 N-gramsgeneratedandtheirfrequencies. . . . . . . . . . . . . . . . . . . . . . . . 25 7 Agevaluescoveredbyequalsizeclassificationbins . . . . . . . . . . . . . . . . . 26 8 Accuracyvaluesforbooleanfeatureexperiments . . . . . . . . . . . . . . . . . . 34 9 Accuracyvaluesfornumericfeatureexperiments . . . . . . . . . . . . . . . . . . 35 10 Accuracyvaluesforn-gramfeatureexperiments . . . . . . . . . . . . . . . . . . . 36 11 Accuracyvaluesforbooleanandnumericfeatureexperiments . . . . . . . . . . . 37 12 Accuracyvaluesforbooleanandn-gramfeatureexperiments . . . . . . . . . . . . 38 13 Accuracyvaluesfornumericandn-gramfeatureexperiments . . . . . . . . . . . . 39 14 Accuracyvaluesforboolean,numeric,andn-gramfeatureexperiments . . . . . . . 40 15 Resultsofwithheldusertesting,100tweetsperinstance. . . . . . . . . . . . . . . 41 16 Resultsofwithheldusertesting,75tweetsperinstance. . . . . . . . . . . . . . . . 42 17 Someassociationrulesfoundinanalysis. . . . . . . . . . . . . . . . . . . . . . . . 43 18 Classassociationrulesfoundinanalysis. . . . . . . . . . . . . . . . . . . . . . . . 44 viii List of Figures Page 1 ThewebformpreparedinthisstudyforTwitteruserstosubmittheirinformation . 9 2 Thewebformsuggestsoptionsastheusertypes . . . . . . . . . . . . . . . . . . . 10 3 Thewebformhighlightsinputerrorsanddisplaysanassociatedmessage . . . . . . 11 4 Description of the nine abbreviation features from Gouws et al.. Each word pair was assigned one feature type classification. Some feature types overlap, such as droplastcharacter andwordbegin. Inthesecases,themorespecificclassification wasassigned(droplastcharacter). . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5 TextcleanseralgorithmprovidedbyGouwsetal.[17] . . . . . . . . . . . . . . . . 16 6 Abbreviationfeatureassignmentalgorithm . . . . . . . . . . . . . . . . . . . . . . 17 7 Tweetandtokendistributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 8 Userdemographicdataspecifiedforthedataset . . . . . . . . . . . . . . . . . . . 23 9 Reportedbirthyearsofparticipantsandagerangesattimeofpublication . . . . . . 24 10 Metricsusedinevaluationofclassifiers. . . . . . . . . . . . . . . . . . . . . . . . 27 11 PlotsofYoutoU featureusepercentagesovertime. . . . . . . . . . . . . . . . . . 46 12 Bestachievedaccuracyforeachfeaturetypeandclassifier. . . . . . . . . . . . . . 48 ix
Description: