ebook img

Predicting Demographics of High-Resolution Geographies with Geotagged Tweets PDF

0.43 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Predicting Demographics of High-Resolution Geographies with Geotagged Tweets

Predicting Demographics of High-Resolution Geographies with Geotagged Tweets OmarMontasser and DanielKifer SchoolofElectricalEngineeringandComputerScience PennsylvaniaStateUniversity UniversityPark,PA16801 [email protected],[email protected] 7 Abstract ple of people prepared by the U.S. Census Bureau, con- 1 tains1-year,3-year,and5-yearmovingaverageestimatesof 0 In this paper, we consider the problem of predicting demo- demographics but with lower resolutions (county-level and 2 graphicsofgeographicunitsgivengeotaggedTweetsthatare blockgroup-level) and higher error. Another important is- composedwithintheseunits.Traditionalsurveymethodsthat n sue isthat the administrativeboundaries change every cen- offerdemographicsestimatesareusuallylimitedintermsof a sus,forexample,the2000block-levelboundariesarediffer- J geographic resolution, geographic boundaries, and time in- tervals. Thus,itwouldbehighlyusefultodevelopcomputa- entfromthe2010block-levelboundaries. Thisraiseschal- 2 tionalmethodsthatcancomplementtraditionalsurveymeth- lenges for researchers who study populations over decades 2 ods by offering demographics estimates at finer geographic becausetheyneedtocrosswalkfromonecensustothenext. resolutions,withflexiblegeographicboundaries(i.e.notcon- So, to compare 2000 demographics with 2010 demograph- ] G finedtoadministrativeboundaries),andatdifferenttimein- ics,theyneedtofigureouthowwouldthe2000demograph- tervals. While prior work has focused on predicting demo- icslooklikeusingboundarydefinitionsfrom2010. L graphicsandhealthstatisticsatrelativelycoarsegeographic . resolutionssuchasthecounty-levelorstate-level, weintro- Using geotagged Tweets to predict demographics of ge- s c duceanapproachtopredictdemographicsatfinergeographic ographic units can be a complementary method to survey [ resolutionssuchastheblockgroup-level. Forthetaskofpre- collection methods as long as their bias can be corrected. dicting gender and race/ethnicity counts at the blockgroup- Thiswouldbeacheaperandfasterapproachtoestimatede- 1 level, an approach adapted from prior work to our problem mographics data at resolutions finer than what traditional v achievesanaveragecorrelationof0.389(gender)and0.569 sources offer. In addition, this approach is not confined to 5 (race)onaheld-outtestdataset. Ourapproachoutperforms administrativegeographicboundariesbutcanbeadaptedto 2 thispriorapproachwithanaveragecorrelationof0.671(gen- 2 der)and0.692(race). customgeographicboundaries. Also, itismoreflexibleby 6 permittingcollectingdemographicsdataatdifferenttimein- 0 tervals(e.g. 6months). . 1 Introduction 1 We present a method to predict demographics of high- 0 Social media data has become increasingly important with resolution geographic units using geotagged Tweets. The 7 applications to many fields such as health (Culotta 2014a) main idea is to learn to predict demographics of a region 1 and sociolinguistics (Eisenstein et al. 2014). Furthermore, basedoncharacteristicsofTweetsinthatregion. Animpor- : v social media has been used to complement traditional sur- tant aspect of our approach is that we do not require label- i veymethodsasafasterandcheaperapproachtocollectin- ingofindividualTweets. Toevaluateourmethod, wetrain X formationandmakepredictions(Bentonetal.2016). models to predict gender and race/ethnicity demographics r Collectingdemographicsdataisusuallyalongandcostly of Census predefined geographies at different resolutions a process which limits the rate and resolution at which this (block, blockgroup, tract, and county) using 2010 Census collection may be performed. The U.S. Census Bureau demographics data as ground truth. At the block-level, we releases a population census every 10 years with demo- achievean averagecorrelation of0.585 (gender)and 0.487 graphicsdataonmultiplegeographicresolutions(U.S.Cen- (race)onaheld-outtestdataset. Wefindthatourapproach susBureau2010).Thesegeographicresolutionsfollowahi- significantly improves upon the results of a competing ap- erarchy:eachstateisdividedintocounties,theneachcounty proachadaptedfrompriorwork. Wealsofindthatfor95% is divided into tracts, then each tract is divided into block ofblocks,blockgroups,tracts,andcountieswithatleast100 groups, and each block group is divided into blocks. The Twitter users, the relative prediction error is at most 1.98, census contains demographics data at the block-level and 1.15,0.90,and0.78respectively. movinguptothestate-level. Also,TheAmericanCommu- WediscussrelatedworkinSection2,weintroduceneces- nity Survey (ACS), which is a statistical survey of a sam- sarydefinitions,notations,andformallydefinetheproblem Copyright(cid:13)c 2017,AssociationfortheAdvancementofArtificial inSection3,wedescribeourapproachinSection4,andwe Intelligence(www.aaai.org).Allrightsreserved. presentexperimentsinSection5. 2 RelatedWork Notation Definition W thesetoftweetedwordsthatappeareding . The availability of large-scale geotagged Twitter data has i i c #oftimeswappeareding . spurred a lot of interest in predicting demographics of ge- w,i i C #oftotalwordsthatappeareding . ographicunits. Itisviewedasacheaperandfastermethod i i to draw inferences that can complement traditional survey uw,i #ofTwitteruserswhousedwingi. methods. There are two strategies in general. First is pre- Ui #ofTwitteruserswhotweetedatleast dicting demographics of geographic units directly. Second oncefromgi. is predicting demographics of Twitter users in those geo- xi Featurevectorofgi. graphic units individually and then performing some form xi,w Valueoffeaturewingi. ofaggregation. D Dimensionofthefeaturespace. Prior work that employs the first strategy spans several applications. Eisenstein, Smith, and Xing (2011) predicted Table1: Mathematicalnotation demographics of Zip Code Tabulation Areas (ZCTA) us- ing geotagged Tweets and Census data. Similarly, Mo- (2014b) weighted the contributions of Twitter users based hammady and Culotta (2014) predicted race composition on their demographics. Also, Almeida and Pappa (2015) of the 100 most populous counties from geotagged Tweets sampled a subset of Twitter users in an area according to and then used that for individual-level labeling of Twitter thedistributionoftherealpopulationinthatarea. Bothap- users. Thereisalsopriorworkthatfocusedonhealthstatis- proachesrequireindividual-levellabellingofTwitterusers. tics. Schwartz et al. (2013b) predicted life-satisfaction of Ourmethod howeverdirectly trains tofix thebias. For ex- counties using county demographics and Tweets. Culotta ample, we know the characteristics of Twitter users in an (2014a) predicted several health statistics (such as obesity area(biased)andweknowthepopulationcharacteristicsof rates) of the 100 most populous counties. Eichstaedt et anarea,andwelearnafunction(model)thatmapsoneinto al. (2015) predicted atherosclerotic heart disease mortality theother. rates of counties from Tweets. Ireland et al. (2015) pre- dicted HIV rates of counties from Tweets. Loff, Reis, and 3 Preliminaries Martins(2015)predictedwell-beingofstatesbyestimating In this section, we setup some necessary definitions, nota- theGallup-Healthwaysindex. Bentonetal.(2016)usedsu- tions, and formally define the problem. Our dataset is a pervisedtopicmodelstopredictresponsestomiscellaneous set of Tweets {t ,t ,...,t }. Each Tweet t is a tuple of surveyquestionssuchaspercentageofsmokersatthestate- 1 2 m i the form (loc ,uid ,< wi,wi,wi,... >) where loc is the level. i i 1 2 3 GPSlocation,uidistheuserid,and< w ,w ,w ,... >is Asdescribed,thesepriorapproachesmadepredictionsat 1 2 3 thesequenceoftokensintheTweet. Wehaveasetofgeo- relativelycoarsegeographicresolutionssuchasthecounty- graphicunits{g ,g ,...,g }. Eachg isatupleoftheform level which may be due to limitations in ground truth data 1 2 n i (shape ,y )whereshapeistheboundarydefinitionandy availability at finer resolutions (e.g. health statistics). Our i i is ground truth count data of a demographic variable (e.g. approachpredictsdemographicsatfinergeographicresolu- gender) of the geographic unit. If we have a demographic tionssuchastheblock-level. Furthermore,weexplorehow variable with k mutually exclusive categories, then y is a largeapopulationneedstobe(i.e.resolution)inordertoget k dimensional vector where y is the count of category j, accuratepredictions. j j ∈[1,k](e.g. forgender,y isthecountofmalesandy is Related to the second strategy, there is a rich body of 1 2 thecountoffemales).ForeachTweett ,wemapittoageo- work on individual-level prediction tasks focusing on at- i graphicunitg ifshape containsloc . Thus,wegroupthe tributes of social media users. These attributes include: j j i Tweetsintodisjointbagsthatcorrespondtothegeographic gender (Burger et al. 2011; Bergsma et al. 2013; Bam- units: man, Eisenstein, and Schnoebelen 2014; Vicente, Batista, and Carvalho 2015), age (Schwartz et al. 2013a; Moseley, ({tj1}Nj=11,y1),({tj2}Nj=21,y2),...,({tjn}Nj=n1,yn) Alm, and Rege 2014), ethnicity (Chang et al. 2010; Chen where geographic unit i has a set of observed Tweets et al. 2015), income and socio-economic status (Preo¸tiuc- {tj}Ni withatotalofN Tweetsandademographicvari- i j=1 i Pietroetal.2015;Lamposetal.2016). Inaddition,several able vector y . Using such a dataset, our goal is to learn a i workspredictedawide-rangeofattributesincludingdemo- model f such that for an unseen geographic unit g with a graphics and emotions (Culotta, Kumar, and Cutler 2015; set of Tweets {tj}Ng , f({tj}Ng ) = y , where y is an Volkova et al. 2015). Adapting these methods to solve our g j=1 g j=1 (cid:99)g (cid:99)g estimateofthetrueunknowndemographicscounty . problem poses several challenges. For example, it requires g Mathematicalnotationisdefinedintable1. supervisedtrainingandlabeleddemographicsforlargenum- bers of Twitter users. These labels have to be collected 4 Method yearlytoaccountforconceptdriftassociatedwithchanging Inthissection,wedescribeourapproachtolearnthemodel generations. Also, once Twitter users are labeled, it is not f. Ourapproachreliesoncomputingafeaturevectorx ∈ immediatelyobvioushowtocombinetheseobservationsto i arrive at the demographics of geographic units since Twit- RD given {tj}Ni , for each geographic unit i ∈ [1,n]. i j=1 ter users with geotagged tweets are a highly biased sample Then,wefitamodeltopredicty givenx . Below,wedis- (cid:98)i i (Malik et al. 2015). To reduce the sampling bias, Culotta cusspossiblefeatureengineeringandmodelingchoices. 4.1 FeatureEngineering graphicunits,wenormalizec bydividingbythenumber w,i of total tweeted words in region g , C ; and u by divid- Inaccordancewithpriorwork,wefocusonfeaturesthatare i i w,i ingbythenumberofTwitterusersthattweetedatleastonce basedonlexicalcontent. Thisismotivatedbyexploringthe inregiong , U . Normalizationshelpdifferentiatebetween most predictive linguistic patterns of demographics. First i i geographic units as they take into account the size of total wediscusspossiblelexicalfeatures, thenwediscusspossi- observations. ble normalization and transformation schemes that can be Usingthesemutuallyexclusiveschemes,wecomputefea- appliedtothesefeatures. turevaluesv foreachgeographicunitg andeachwordw i,w i Features There are several possible lexical features that inthesetoftweetedwordsingi,Wi(forw ∈/ Wi,vi,w =0): canbeusedtorepresentgeographicunits. Theseincludebut • RawWord: v =c . i,w w,i arenotlimitedto: lexicons,latenttopics,wordsandphrases • NormalizedWord: v = cw,i. (bag-of-words),andembeddings.Thesefeaturesarenotmu- i,w Ci tuallyexclusiveandcanbecombinedtogether. • RawUser: v =u . i,w w,i Lexicons are predefined word-to-category mappings that • NormalizedUser: v = uw,i can be used to represent each geographic unit by the fre- i,w Ui In our experiments our feature set includes only words quency of each category (Schwartz et al. 2013b; Culotta that appear in the training split. This is to ensure that we 2014a).Lexiconsusuallyhavestrongerdomainassumptions accountfortheeffectofout-of-vocabularywords. (comparedtobag-of-words)(OConnor,Bamman,andSmith 2011)andarelimitedtospecificapplicationssuchashealth Transformations After computing a bag-of-words repre- andpersonality(Culotta2014a). So, wedonotexploreus- sentation using Raw Word, Raw User, Normalized Word, inglexiconsasfeatures. orNormalizedUser,weperformfeaturetransformationson Each geographic unit can also be represented by its dis- the representation. We explore different transformations: tributionoverasetoflatenttopics,mostcommonlylearned Term Frequency-Inverse Document Frequency (TFIDF), usinglatentDirichletallocation(Blei,Ng,andJordan2003). Anscombe, Logistic, and Gaussian. Not every transforma- Schwartz et al. (2013b) report that this is better than using tionisappliedtoeveryrepresentation,forexample,TFIDF lexicons for their task of predicting well-being. Benton et isappliedtoRawWordandRawUser, andtherestareap- al.(2016)exploredvariantsoftopicmodelsthatareguided pliedtoNormalizedWordandNormalizedUser. by supervision to generate feature representations. Inter- If we have a feature vector v for a geographic unit g i i estingly, theyfoundthatthebag-of-wordsrepresentationis (computed using Raw Word, Raw User, Normalized Word, competitivewiththebestsupervisedtopicmodels. or Normalized User), we transform it to x by applying an i We explored using embeddings as features by learning element-wise transformation on each v . We apply the i,w representations of geographic units using Paragraph Vec- transformation only on v (cid:54)= 0 to preserve the sparsity i,w tor (Le and Mikolov 2014). This is a similar technique to ofourfeaturevectors. Word2Vec(MikolovandDean2013),butratherthanlearn- TFIDF:Theworddistributionacrossallgeographicunits ing representations of words, it learns representations of has a long-tail shape, with few words appearing in all geo- paragraphsordocuments. Wemodeleachg asadocument graphicunitsandless-frequentwordsappearinginfewgeo- i consisting of sentences which are the Tweets {tj}Ni . We graphicunits.ThemotivationbehindusingTFIDFistohelp i j=1 foundthatbag-of-wordsiscompetitivewiththisapproach. the model take that into account, and re-weight the word Bag-of-words uses words and/or phrases as features in- counts inversely. Each geographic unit represents a docu- steadofusingcategoriesortopics. Weusethisrepresenta- ment, we learn the inverse document frequency in the fol- tion because it is simpler and has weaker domain assump- lowingmanner: tions(OConnor,Bamman,andSmith2011). n idf(w)=log 1+|{i∈[1,n]:w ∈W }| Normalizations We discuss different ways of counting i and normalizing occurrences of words. Our discussion is WeusethistransformationwithRawWordorRawUser, basedonusingabag-of-wordsrepresentationbutthesetech- sov =c orv =u . Then,x =v (idf(w)+ i,w w,i i,w w,i i,w i,w niques can also be applied to lexicons. We start with com- 1).Notethatweadda1toidf(w)becausewedonotwantto putingrawcountsoftweetedwords,c andu ,foreach completelyignorewordsthatappearinallgeographicunits. w,i w,i wordwandregiong wherei∈[1,n]. c isthenumberof Inourexperiments,welearnidf(·)basedonlyonthetrain- i w,i timesawordwistweetedinregiong . Itisoblivioustothe ingsplit. i numberofTwitterusersthatusedw.So,itispossibleforthe Anscombe: We applied this transformation to stabilize featurevectortobeskewedbyTwitterusersthatuseaword thevarianceofwordfrequencies. Thedistributionofaword wmanytimes,eitherinoneormanyTweets. Toaccountfor mayberight-skewed(i.e. appearsalotinafewgeographic that, u counts the number of distinct Twitter users that unitsandappearslittleelsewhere),theAnscombetransform w,i usedwing .SincethedistributionofgeotaggedTweetsand helpsadjustthisskewnessandmakethedistributionroughly i geotagTwitterusersisnotuniformacrossregions(Maliket symmetric. Ithelpsturnarandomvariabledistributiontobe al.2015),usingrawcountssuchasc andu asfeature more Gaussian (Anscombe 1948). Schwartz et al. (2013a) w,i w,i values will result in highly imbalanced feature vectors for appliedthistransformationinpredictingselectdemograph- geographicunits.Tobalancethesedifferencesbetweengeo- ics of Facebook users. We use this transformation with Normalized Word or Normalized User, so v = cw,i or 5 Experiments (cid:113) i,w Ci v = uw,i. Then,x =2 v + 3. Weevaluateourapproachandcompetingapproaches(base- i,w Ui i,w i,w 8 lines) on both variants of the problem using Census pre- Logistic and Gaussian: Inspired by the use of activa- defined geographies at different resolutions: block, block tion functions in neural networks, we explored applying a group,tract,andcounty. Amongthefour,block-levelisthe non-linear activation function φ(·) on our word frequen- highestresolution,andcounty-levelisthelowestresolution. cies(computedwithNormalizedWordorNormalizedUser). In the following subsections we provide details about our Our intuition is that the non-linearity of φ(·) would help experiments: baselines, data, preprocessing, training, and increase the capacity of the model. We separately used results. φ(x) = 1 (Logistic)andφ(x) = e−x2 (Gaussian). We 1+e−x setx = φ(v ),wherev = cw,i orv = uw,i. We 5.1 Baselines i,w i,w i,w Ci i,w Ui alsoexploredotheractivationfunctionssuchasTanH,Arc- We compare our approach with an approach adapted from Tan,andSoftsignbuttheydidnotshowpromisingresults. (Mohammady and Culotta 2014), where they used Tweets to predict race/ethnicity composition of counties. In their 4.2 Modeling approach, they used a bag-of-words representation normal- ized by Twitter users with tweeted words and words from We explore two variants of the problem: predicting demo- descriptionfieldsofTwitterusersasfeatures.Weadapttheir graphics of geographic units when population size is un- approachbyusingthesametypesoffeatures.Wecomputea knownandwhenpopulationsizeisknown. bag-of-wordsrepresentationusingUserNormalization,and PopulationSizeisUnknown Inthissettingwewouldlike thenwetrainamodelwiththisrepresentationandevaluateit topredictdemographiccounts(e.g. gender)y ofaregion onbothvariantsoftheproblem. Notethatinourcompeting i,j g foreachcategory(e.g. maleandfemale)j ∈ [1,k]with- configurations we do not use features from the description i outaccesstothepopulationsizeofg .Wechoosealinearre- fieldofTwitterusers. i gressionmodelforscalability. Foreachcategoryj ∈ [1,k], In the setting where population size is known, we also weoptimizethefollowingobjectivefunction: compare our models with a baseline that always uses gen- der and race/ethnicity proportions at the national level to n w =argmin 1 (cid:88)(w ·x −y )2+λ||w ||2 predict category counts of blocks, blockgroups, tracts, and j wj 2n j i i,j j 2 counties. The 2015 national level estimates of proportions i=1 are: Male(49.2%),Female(50.8%),White(61.6%), Black wherew istheweightvectorlearnedforcategoryj,and or African American (12.4%), Asian (5.4%), Hispanic or j λisanl2regularizationparametertopreventoverfitting.We Latino(17.6%),Other(3%)(U.S.CensusBureau2016). also explored l1 and ElasticNet regularizations, but they yieldedsimilarresults. 5.2 Data For a region gu with an unknown demographic category WecollectedalargedatasetofgeotaggedTweetsusingTwit- county ,wemapthebagofTweets{tj}Nu inregiong ter’sStreamingAPIfromJune12,2013toJanuary31,2014. u,j u j=1 u to a transformed feature vector x (we use the same con- WeonlyincludedTweetscomposedinthecontiguousU.S. u figuration that is used in optimizing the objective function, which consists of the 48 adjoining states and Washington e.g. RawWordwithTFIDF)andthenestimatey , where D.C. and does not include Alaska and Hawaii for exam- u,j yˆ =w ·x . ple. We used a bounding box of [125.0011,66.9326]W × u,j j u [24.9493,49.5904]N. PopulationSizeisKnown Inthissettingweassumethat BasedonaTweet’sGPScoordinates,weannotateitwith wehaveaccesstothetruepopulationcountp inregiong . i i the geographic identifier (GEOID) of the block that it ap- We fit a linear regression model to predict log yi,j which yi,q peared in. The U.S. Census Bureau provides geographic is log of the ratio of a demographic category count yi,j to boundaryfiles(shapefiles)foreachstate,whereeachshape- another demographic category count yi,q. We choose one file contains the boundary definitions for all the blocks in demographiccategoryasthedenominator(e.g. q = 1)and thatstate. ThisenablesustomapeachTweettoitsrespec- then learn wj for j ∈ [2,k] by optimizing the following tive block. We used 2010 shapefile definitions to match objectivefunction: with 2010 Census demographics data. Overall, we had 565,350,007Tweetsannotatedwithblock-levelGEOIDs. n w =argmin 1 (cid:88)(w ·x −log yi,j)2+λ||w ||2 j wj 2n j i yi,q j 2 Demographics Data We used data from the 2010 Cen- i=1 sus. The U.S. Census Bureau provides aggregate count data on different demographics such as gender, age, and Toestimateademographiccategorycounty forregion i,j race at multiple geographic levels (including block-level to g usingp ,wecompute: i i county-level). WespecificallyuseddatafromtheSummary  1 p j =q File 1 tables P12 and P5, for gender and race/ethnicity, re- yˆ = 1+(cid:80)km=2ewm·xi i , spectively. We used the Data Finder tool provided by the i,j  ewj·xi p j (cid:54)=q National Historical Geographic Information System (Min- 1+(cid:80)km=2ewm·xi i nesota Population Center 2011) to collect this data. We collected gender and race/ethnicity data at the block-level beforetraining.Weperformedagridsearchonthefollowing uptothecounty-level. Forgenderweusedtwocategories: hyper-parameters: MaleandFemale.Forrace/ethnicityweusedfivecategories: • λ∈{10−6,10−5,10−4,0.001,0.01,0.1} White, African American or Black, Asian, Hispanic, and • η ∈{10−6,10−5,10−4,0.001,0.01,0.1,1.0,10.0} Other. In each case, the categories are mutually exclusive. 0 Note that there is a time difference of two and a half years Thehyper-parametercombinationthatscoredthehighest between the demographics data and Twitter data. This can R2 (coefficientofdetermination)scoreonthevalidationset biasresultsandisadirectionforfuturework. was used for training on the entire training split (training and validation). We shuffled the training dataset after each 5.3 Preprocessing epoch of SGD training. We fixed the number of epochs to Twitterisfilledwithspamandorganizationalaccountsthat 10andρ=0.25.Earlyexperimentsshowedthattrainingfor post content we deem irrelevant to our application, as we more epochs (e.g. 100) does not improve the performance areinterestedincontentproducedbypersonalaccounts. To significantly,andlikewisechangingthevalueofρ. reduce the likelihood of including content from organiza- 5.5 Results tional/spam accounts, we removed Tweets from accounts with more than 1000 followers or 1000 friends (Lee, Eoff, We evaluate the baselines and our models (with different andCaverlee2011;McCorriston,Jurgens,andRuths2015) feature engineering configurations) on both variants of the and Tweets containing URLs (Guo and Chen 2014). We problem (population count is unknown vs. known). In also removed Retweets by checking for the existence of both variants we predict demographic category counts (y j retweeted status field or the RT token in Tweet text forj ∈[1,k])andevaluatetheperformanceonthetestsplit itself. Consequently, our dataset got narrowed down to using Pearson correlation r and the coefficient of determi- 423,622,202Tweetswith4,027,594uniqueTwitterusers. nation R2. Note that R2 compares the performance of a To build a bag-of-words representation, we split the text modelrelativetothebaselineofalwayspredictingtheaver- oftheTweetsintounigramtokens. Thereareseveralthings agevalueofthetestset. Table2summarizestheresultsof toconsiderwhentokenizingTweetssuchas: hashtags,user- our experiments (due to limited space we include only re- name mentions, emails, html entities, emoticons, etc. For sults of our best configurations). For Pearson correlations, thistask,weusedTwokenize(Ott2013),atokenizerde- allofthemarestatisticallysignificantusingatwo-tailedtest signedforTweetswhichtreatshashtags, emoticons, blocks withp-value<10−4. of punctuation marks, and other symbols as tokens. After PopulationSizeisUnknown Inthissetting,wefindthat tokenizing, weremovedusernamementions, emails, single ourmodelsoutperformthebaselineadaptedfrom(Moham- punctuation marks, and English stopwords. We converted mady and Culotta 2014), where correlation r (averaged alltokenstolowercase.Wealsochosetokeepemoticonsas acrossgenderandrace)improvesbyafactorof2.84(block), OConnor et al. (2010) showed in their analysis that groups 1.41(blockgroup),1.31(tract),and1.21(county)usingour with certain demographics (high percentage of Hispanics) UserNormalizationwithGaussianconfiguration. useemoticonsalot. Wefindthatourfeaturetransformations(e.g. Anscombe) improve upon the results of plain bag-of-words representa- 5.4 Training tionssignificantly.ComparedwithplainUserNormalization Foragivengeographicresolution,werandomlysplitthege- (whichisbetterthanplainWordNormalization),correlation ographicunitsthathaveTweetsinto90%trainingand10% r improves by a factor of 1.88 averaged across gender and testing (e.g. we train on 90% of blocks, and predict de- race and then across resolutions using User Normalization mographicsoftheremaining10%). 10%ofthegeographic with Gaussian. This tells us that such transformations help unitsinthetrainingsplitwerechosenrandomlyasavalida- predictdemographicsbetter. tion set. Note that this splitting is done separately for each geographic resolution. We have n - n examples: Population Size is Known In this setting, we find that train test 5,188,608-576,513(block);194,610-21,624(blockgroup); predictions improve overall and are better than predictions 65,239-7,249(tract);2,798-311(county). Thenwecom- intheothervariant(thisisexpectedbecauseweknowtotal puteaconfiguration(e.g.RawUserwithTFIDF)anduseall population size). For the task of predicting gender counts thewordsthatappearinthetrainingsplitasfeatures(more (inthiscaseqcorrespondstofemale),thebaselinesandour than22millionfeatures). Sincethisisalarge-scalelearning modelsperformcomparablytoeachother. Thisisthecase problem we utilize stochastic gradient descent (SGD) to fit becausethereislittlevariationingenderproportionsacross ourmodels. TouseSGD,wehavetochoosealearningrate geographies,soiftotalpopulationisknown,evenabaseline updatepolicy. Weusedinversescaling: thatpredictshalfofthatforbothcategorieswilldowell. Forthetaskofpredictingracecounts(inthiscaseq cor- η ητ = 0 respondstowhite), wefindthatourmodelsoutperformthe τρ baseline of predicting counts based on national-level pro- Where η is the initial learning rate, τ is the time step portions. Correlationrimprovesbyafactorof1.15(block), 0 (indexedbyepochandtrainingexample),andρisahyper- 2.05(blockgroup),2.42(tract),and1.12(county)usingour parameterthataffectsthedecreaserateofthelearningrate. RawUserwithTFIDFconfiguration.Comparedtothebase- Thereareseveralhyper-parametersthatneedtobeselected line adapted from (Mohammady and Culotta 2014), corre- PopulationSizeUnkown PopulationSizeKnown Res Demo Metric B WL UA UG C B WA WL UG RUT r 0.183 0.576 0.554 0.585 0.957 0.958 0.960 0.960 0.959 0.959 Gender R2 0.033 0.333 0.308 0.344 0.922 0.914 0.916 0.917 0.916 0.915 Block r 0.195 0.480 0.462 0.487 0.620 0.708 0.701 0.700 0.704 0.714 Race R2 0.039 0.235 0.218 0.243 0.377 0.216 0.320 0.376 0.391 0.449 r 0.389 0.671 0.670 0.667 0.980 0.982 0.982 0.982 0.979 0.982 Gender R2 0.150 0.449 0.448 0.444 0.959 0.963 0.964 0.964 0.958 0.964 BG r 0.569 0.692 0.690 0.683 0.407 0.780 0.783 0.788 0.783 0.835 Race R2 0.311 0.491 0.489 0.480 0.184 0.613 0.632 0.632 0.617 0.676 r 0.458 0.723 0.726 0.723 0.985 0.986 0.987 0.987 0.982 0.988 Gender R2 0.204 0.523 0.523 0.522 0.965 0.972 0.973 0.972 0.965 0.974 Tract r 0.669 0.756 0.757 0.752 0.365 0.796 0.833 0.838 0.825 0.884 Race R2 0.457 0.585 0.587 0.578 0.151 0.640 0.705 0.709 0.696 0.761 r 0.809 0.986 0.985 0.988 0.999 0.999 0.999 0.999 0.999 0.999 Gender R2 0.513 0.969 0.968 0.973 0.999 0.999 0.999 0.999 0.999 0.999 County r 0.745 0.879 0.868 0.898 0.850 0.963 0.834 0.945 0.962 0.956 Race R2 0.491 0.740 0.726 0.775 0.694 0.826 0.551 0.837 0.886 0.796 A-Anscombe,B-(MohammadyandCulotta2014),BG-Blockgroup,C-NationalProportions,Demo-Demographic G-Gaussian,L-Logistic,Res-Resolution,RU-RawUser,T-TFIDF,U-NormalizedUser,W-NormalizedWord Table2:Performanceresultsonmultipleresolutionsacrossgenderandrace/ethnicitypredictiontasks. Ineachproblemvariant, boldresultsinarowrepresentconfigurationsthatarestatisticallyindistinguishableusingapairedt-testwithp-value≥0.05. demographicsat,withreasonableaccuracy. Weplottheav- County Tract eragerelativeerroracrossgenderandraceversusnumberof 1.0 0.8 Twitterusers(thosewithgeotaggedTweets)inFigure1. We 0.8 0.6 0.6 findthat95%ofgeographicregionswithatleast100Twitter r 0.4 0.4 users,havelowrelativeerrors. Intheseregions,therelative rro 0.2 0.2 errorisatmost1.98(block),1.15(blockgroup),0.90(tract), E 0.0 0.0 e 102 103 104 105 102 103 104 105 and0.78(county). v ati Blockgroup Block Rel 1.2 2.0 6 Conclusion 1.0 1.5 0.8 0.6 1.0 Inthispaper,wehaveshownthatgeotaggedTweetscanbe 0.4 0.5 used to estimate demographics of high-resolution geogra- 0.2 0.0 0.0 phies. Our method can be used as an alternative or a com- 102 103 104 105 102 103 104 105 plement to survey methods. We have shown that certain Number of Twitter Users featuretransformationssuchasAnscombe,TFIDF,Logistic, andGaussiansignificantlyimprovepredictionperformance Figure1: Relativeerrorfor95%ofgeographicregionswith relativetocompetingbaselines.Wehavealsoshownthatthe atleast100Twitterusers(RawUserwithTFIDF). ourmethodisabletolearnproportionsofdemographiccat- egoriesandcanprovideaccuratepredictionsatregionswith atleast100Twitterusers. lation r improves by a factor of 1.01 (block), 1.07 (block Forfuturework,itisworthbringingattentiontotheeffect group),1.11(tract),and1.05(county). of data sampling rate on prediction. According to Eisen- We have shown that when we are interested in learning stein et al. (2014), word frequencies normalized by users proportionsofdemographics,ourapproachoutperformsex- (NormalizedUser)arenotinvarianttothesamplingrateof istingbaselines. Interestingly,wefindthatourbestconfigu- the data. If we remove half the tweets, then these frequen- rationisRawUserwithTFIDF(whichisnotthecasewith cieswilldecreasebecausethenumberofuserswilldecrease the other variant). This may in part be due to the fact that moreslowlythanrawwordcounts. So,itwouldbeinterest- UserNormalizationreducestheskewnessoffeaturevectors ingtoinvestigatemethodsthatcanminimizethevarianceof andthefactthattheTFIDFtransformationincreasestheim- suchnormalizationstothesamplingrate. portanceofless-frequentwordsanddampenstheimportance ofmorefrequentwords. 7 Acknowledgments Prediction Accuracy vs. Number of Twitter Users We ThisworkwassupportedbyNSFaward#1054389. Wealso explorethefinestgeographicresolutionthatwecanpredict thankGuangqingChiforprovidingtheTwitterdata. References [2016] Lampos, V.; Aletras, N.; Geyti, J. K.; Zou, B.; and Cox, I. J. 2016. Inferring the socioeconomic status of social media [2015] Almeida,J.M.,andPappa,G.L. 2015. Twitterpopulation usersbasedonbehaviourandlanguage. InEuropeanConference sample bias and its impact on predictive outcomes: a case study onInformationRetrieval,689–695. Springer. onelections. InProceedingsofthe2015IEEE/ACMInternational ConferenceonAdvancesinSocialNetworksAnalysisandMining [2014] Le,Q.V.,andMikolov,T.2014.Distributedrepresentations 2015,1254–1261. ACM. ofsentencesanddocuments. InICML,volume14,1188–1196. [1948] Anscombe,F.J.1948.Thetransformationofpoisson,bino- [2011] Lee,K.;Eoff,B.D.;andCaverlee,J. 2011. Sevenmonths mialandnegative-binomialdata. Biometrika35(3/4):246–254. withthedevils: Along-termstudyofcontentpollutersontwitter. [2014] Bamman,D.;Eisenstein,J.;andSchnoebelen,T.2014.Gen- InICWSM. deridentityandlexicalvariationinsocialmedia. JournalofSoci- [2015] Loff,J.; Reis,M.; andMartins,B. 2015. Predictingwell- olinguistics18(2):135–160. being with geo-referenced data collected from social media plat- [2016] Benton,A.;Paul,M.J.;Hancock,B.;andDredze,M. 2016. forms. In Proceedings of the 30th Annual ACM Symposium on Collectivesupervisionoftopicmodelsforpredictingsurveyswith AppliedComputing,1167–1173. ACM. social media. In Thirtieth AAAI Conference on Artificial Intelli- [2015] Malik,M.M.;Lamba,H.;Nakos,C.;andPfeffer,J. 2015. gence. Populationbiasingeotaggedtweets. InNinthInternationalAAAI [2013] Bergsma, S.; Dredze, M.; VanDurme, B.; Wilson, T.; and ConferenceonWebandSocialMedia. Yarowsky, D. 2013. Broadly improving user classification via [2015] McCorriston,J.;Jurgens,D.;andRuths,D. 2015. Organi- communication-basednameandlocationclusteringontwitter. In zationsareuserstoo:Characterizinganddetectingthepresenceof HLT-NAACL,1010–1019. organizationsontwitter. InNinthInternationalAAAIConference [2003] Blei,D.M.;Ng,A.Y.;andJordan,M.I.2003.Latentdirich- onWebandSocialMedia. letallocation. JournalofmachineLearningresearch3(Jan):993– [2013] Mikolov,T.,andDean,J. 2013. Distributedrepresentations 1022. ofwordsandphrasesandtheircompositionality. Advancesinneu- [2011] Burger, J. D.; Henderson, J.; Kim, G.; and Zarrella, G. ralinformationprocessingsystems. 2011. Discriminating gender on twitter. In Proceedings of the [2011] MinnesotaPopulationCenter,U.o.M. 2011. NationalHis- ConferenceonEmpiricalMethodsinNaturalLanguageProcess- toricalGeographicInformationSystem:Version2.0. ing,1301–1309. AssociationforComputationalLinguistics. [2014] Mohammady,E.,andCulotta,A. 2014. Usingcountyde- [2010] Chang,J.;Rosenn,I.;Backstrom,L.;andMarlow,C. 2010. mographicstoinferattributesoftwitterusers. ACL2014 7. epluribus:Ethnicityonsocialnetworks. ICWSM10:18–25. [2014] Moseley,N.; Alm,C.O.; andRege,M. 2014. Towardin- [2015] Chen, X.; Wang, Y.; Agichtein, E.; and Wang, F. 2015. ferringtheageoftwitteruserswiththeiruseofnonstandardabbre- Acomparativestudyofdemographicattributeinferenceintwitter. viationsandlexicon. InInformationReuseandIntegration(IRI), Proc.ICWSM. 2014IEEE15thInternationalConferenceon,219–226. IEEE. [2015] Culotta, A.; Kumar, N. R.; and Cutler, J. 2015. Predict- [2013] Ott, M. 2013. Twokenize. http://github.com/myleott/ark- ingthedemographicsoftwitterusersfromwebsitetrafficdata. In twokenize-py. AAAI,72–78. [2011] OConnor,B.;Bamman,D.;andSmith,N.A. 2011. Com- [2014a] Culotta,A. 2014a. Estimatingcountyhealthstatisticswith putationaltextanalysisforsocialscience: Modelassumptionsand twitter. In Proceedings of the 32nd annual ACM conference on complexity. publichealth41(42):43. Humanfactorsincomputingsystems,1335–1344. ACM. [2010] OConnor, B.; Eisenstein, J.; Xing, E.P.; andSmith, N.A. [2014b] Culotta,A. 2014b. Reducingsamplingbiasinsocialme- 2010. Amixturemodelofdemographiclexicalvariation. InPro- diadataforcountyhealthinference. InJointStatisticalMeetings ceedingsofNIPSworkshoponmachinelearningincomputational Proceedings. socialscience,1–7. [2015] Eichstaedt, J. C.; Schwartz, H. A.; Kern, M. L.; Park, G.; [2015] Preo¸tiuc-Pietro,D.;Volkova,S.;Lampos,V.;Bachrach,Y.; Labarthe,D.R.;Merchant,R.M.;Jha,S.;Agrawal,M.;Dziurzyn- and Aletras, N. 2015. Studying user income through language, ski,L.A.;Sap,M.;etal. 2015. Psychologicallanguageontwitter behaviourandaffectinsocialmedia. PloSone10(9):e0138717. predicts county-level heart disease mortality. Psychological sci- ence26(2):159–169. [2013a] Schwartz,H.A.;Eichstaedt,J.C.;Kern,M.L.;Dziurzyn- ski, L.; Ramones, S. M.; Agrawal, M.; Shah, A.; Kosinski, M.; [2014] Eisenstein,J.;O’Connor,B.;Smith,N.A.;andXing,E.P. Stillwell,D.; Seligman,M.E.; etal. 2013a. Personality,gender, 2014. Diffusion of lexical change in social media. PloS one andageinthelanguageofsocialmedia: Theopen-vocabularyap- 9(11):e113114. proach. PloSone8(9):e73791. [2011] Eisenstein,J.;Smith,N.A.;andXing,E.P. 2011. Discov- [2013b] Schwartz,H.A.;Eichstaedt,J.C.;Kern,M.L.;Dziurzyn- eringsociolinguisticassociationswithstructuredsparsity. InPro- ski, L.; Lucas, R. E.; Agrawal, M.; Park, G. J.; Lakshmikanth, ceedingsofthe49thAnnualMeetingoftheAssociationforCom- S.K.;Jha,S.;Seligman,M.E.;etal. 2013b. Characterizinggeo- putationalLinguistics: HumanLanguageTechnologies-Volume1, graphicvariationinwell-beingusingtweets. InICWSM. 1365–1374. AssociationforComputationalLinguistics. [2010] U.S. Census Bureau, D. P. 2010. Geographic Terms and [2014] Guo, D., andChen, C. 2014. Detectingnon-personaland Concepts-GeographicPresentationofData. spam users on geo-tagged twitter network. Transactions in GIS 18(3):370–384. [2016] U.S.CensusBureau, P.D. 2016. AnnualEstimatesofthe Resident Population by Sex, Race, and Hispanic Origin for the [2015] Ireland, M. E.; Schwartz, H. A.; Chen, Q.; Ungar, L. H.; UnitedStates,States,andCounties:April1,2010toJuly1,2015. and Albarrac´ın, D. 2015. Future-oriented tweets predict lower county-levelhivprevalenceintheunitedstates.HealthPsychology [2015] Vicente,M.; Batista,F.; andCarvalho,J.P. 2015. Twitter 34(S):1252. genderclassificationusinguserunstructuredinformation.InFuzzy Systems (FUZZ-IEEE), 2015 IEEE International Conference on, 1–7. IEEE. [2015] Volkova,S.; Bachrach,Y.; Armstrong,M.; andSharma,V. 2015.Inferringlatentuserpropertiesfromtextspublishedinsocial media. InAAAI,4296–4297.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.