ebook img

When is it Biased? Assessing the Representativeness of Twitter's Streaming API PDF

0.35 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview When is it Biased? Assessing the Representativeness of Twitter's Streaming API

When is it Biased? Assessing the Representativeness of Twitter’s Streaming API Fred Morstatter Jürgen Pfeffer Huan Liu ArizonaStateUniversity CarnegieMellonUniversity ArizonaStateUniversity 699S.MillAve 5000ForbesAve. 699S.MillAve Tempe,AZ,85281 Pittsburgh,PA,15213 Tempe,AZ,85281 [email protected] [email protected] [email protected] ABSTRACT millionuserswhopublish500milliontweetseachday1 from Twitter has captured the interest of the scien- their web browsers and mobile phones. Due to Twitter’s 4 tificcommunitynotonlyforitsmassiveuserbase massivesizeandeaseofmobilepublication,Twitterhasalso 1 and content, but also for its openness in shar- becomeacentraltoolforcommunicationduringprotests[3] 0 ing its data. Twitter shares a free 1% sample of and disasters [1]. This has caused an immense push from 2 its tweets through the “Streaming API”, a ser- both the computer and social science research communities n vicethatreturnsasampleoftweetsaccordingto who have used data from the site for wide applications of a a set of parameters set by the researcher. Re- social media research from predicting users’ location [5] or J cently, research has pointed to evidence of bias to characterize the life cycle of news stories [4]. 0 inthedatareturnedthroughtheStreamingAPI, 3 raising concern in the integrity of this data ser- Twitter’s policy for data sharing is very open, providing a viceforuseinresearchscenarios. Whilethesere- free“StreamingAPI”2 thatreturnstweetsmatchingaquery ] sultsareimportant,themethodologiesproposed provided by the Streaming API user. One drawback of the I in previous work rely on the restrictive and ex- Streaming API is that it only returns at most 1% of the S pensiveFirehosetofindthebiasintheStreaming tweets on Twitter at a given moment. Once the volume of . s APIdata. Inthisworkwetackletheproblemof the query surpasses 1% of all of the tweets on Twitter, the c finding sample bias without the need for “gold responseissampled. ThewayinwhichTwittersamplesthe [ standard”Firehose data. Namely, we focus on data is unpublished. Recent research [15] has shown that 1 finding time periods in the Streaming API data there is evidence of bias in this sampling mechanism under v where the trend of a hashtag is significantly dif- certain conditions, leaving researchers to wonder when the 9 ferentfromitstrendinthetrueactivityonTwit- Streaming API is representative, and when it is biased. 0 ter. Weproposeasolutionthatfocusesonusing 9 an open data source to find bias in the Stream- Onewaytogetaroundthe1%limitistopurchasetheTwit- 7 ingAPI.Finally,weassesstheutilityofthedata terFirehose,afeedofferedbyTwitterthatallowsforaccess . source in sparse data situations and for users is- to 100% of all of the public tweets posted on the site. Sim- 1 0 suing the same query from different regions. plycomparingtheresultsfromtheStreamingAPIwiththe 4 Firehoseisonewaytoverifytheresultsofthesampledser- 1 CategoriesandSubjectDescriptors vice. Thisistheapproachtakenin[15]. Unfortunately,ver- : ifying results using the Firehose is not an option for most v H.2.8 [Database Applications]: Data Mining researchersasaccesstoTwitter’sFirehoseisrestrictivelyex- i X pensiveandaccessislimitedtouserswithabusinessagree- Keywords ment with Twitter3. r a Data Sampling, Sampling Bias, Twitter Analysis, Big Data Inthiswork,wedefine“bias”assamplebias. Wesaythata hashtagis“biased”iftherelativetrendisstatisticallysignif- 1. INTRODUCTION icantly over-represented or underrepresented in contrast to Twitterisamicrobloggingsitewhereusersshareshort,140- its true trend on Twitter. In particular, we are looking for charactermessagescalled“tweets”. Twitterhasbecomeone particulartimeperiodsofbiasintheStreamingAPI.Based of the largest social networking sites in the world with 240 on Twitter’s documentation, the sample size is determined by the volume of a query at a point in time. There are times when the sample is representative, and times when it is biased. We try to find time periods where the data fromtheStreamingAPIisbiased,meaningnotanaccurate representation of the true activity on Twitter. 1https://blog.twitter.com/2013/ new-tweets-per-second-record-and-how 2https://dev.twitter.com/docs/streaming-apis 3https://dev.twitter.com/discussions/2752 The focus of this paper lies in finding and assessing an al- andFirehosedata,lookingforevidenceofbiasineachfacet. ternative method for bias detection in the Streaming API. They obtained widely different results across facets. First, Our goal is to detect the bias automatically, using meth- theystudiedthestatisticaldifferencesbetweenthetwodatasets, ods that are openly available to researchers. We begin this using correlation to understand the differences between the processbydiscussingtherelatedwork,whichincludesadis- top n hashtags in the two datasets. They find some bias in cussion of other work that discovers bias in the Streaming the occurrence of the top hashtags for low values of n. API.Wecontinuetointroduceandvetanotherdatasource, the Sample API4, and propose a methodology that utilizes The authors also compared topical facets of the text by this data source to show a user when there is bias in the extracting topics with LDA [2], where they found similar resultsoftheirStreamingAPIquery. Finally,weshowthat evidence of bias. The authors discover that the topics ex- theStreamingAPIgivesnearlythesameresultstoidentical tracted through LDA are significantly different than those queries originating from different points around the world, extracted from the gold standard Firehose data. The other and to identical queries started at different points in time facets compared in the data were the networks extracted during the overlap of their execution. This work is the full from the dataset. Here, the authors extracted the User × version of [14]. User retweet network from both sources and compare cen- trality measures across the two networks. They find that, 2. RELATEDWORK onaverage,theStreamingAPIisabletofindthemostcen- tral users in the Firehose 50% of the time. The final facet Ourrelatedworkfallsintothreeareas: oneconcerningwork theycompareisthedistributionof“geotagged”tweets. Here, previously done on Twitter’s Streaming API, another con- they find no bias and that the number of geotagged tweets cerning research done to verify other black box systems on fromtheStreamingAPIisover90%ofthatintheFirehose. the web, and finally another area devoted to previous evi- dence of bias found on Twitter’s Streaming API. 3. DISCOVERINGBIASINTHESTREAM- 2.1 WorkwiththeStreamingAPI INGAPIWITHOUTTHEFIREHOSE Twitter’s Streaming API has been used throughout the do- With work showing evidence that the Streaming API is bi- main of social media and network analysis to generate un- ased,researchersmustbeabletotellwhethertheirStream- derstanding of how users behave on these platforms. It has ing API sample is biased. Vetting their dataset using the beenusedtocollectdatafortopicmodeling[10,16],network methodology proposed with the Firehose is prohibitive for analysis [19], and statistical analysis of content [12], among many reasons. We propose a methodology that can give an others. Researchers’relianceuponthisdatasourceissignif- indication of bias for a particular hashtag. icant, and these examples only provide a cursory glance at the tip of the iceberg. In this section, we investigate whether another open data source, Twitter’s Sample API, can be used to find bias in 2.2 BiasinBlackBoxSystems the Streaming API. We show that using the Sample API, Thetopicofassessingtheresultsfromablackboxsystemis onecanaccuratelydetectbiasintheStreamingAPIwithout relatedtoourwork. Anotherwebsciencesblackboxthathas the need of the prohibitive Firehose. We focus on alterna- been studied is Amazon’s Mechanical Turk (AMT). In [6], tive methods to help the user understand when their data the authors assess the representativeness of the users on diverges from the true activity on Twitter. We continue to AMT,andprovideexampletasksshowingareaswherethese showthatnotonlycanonefindthebiasusingthismethod, usersgivegoodperformance. Inasimilarvein,[18]proposes butthattheseresultsareconsistentregardlessofwhenand amethodtocorrectforresponsebiasintheresultsobtained where the Streaming API query was issued. from AMT. In the area of social media research, this topic has also been studied from the perspective of link propaga- 3.1 ValidationoftheStreamingAPI tion. In[7],theauthorsanalyzetheeffectdatasamplinghas Wedefinethetruetrendofaparticularhashtag,hasafunc- on the way link propagation is perceived. Specifically, the tion,f(h). Wedefinethetrendofhasitisconveyedthrough authors study URLs shared on sampled Twitter data. theStreamingAPIast(h). Tomakethisestimation,wecon- sider another freely-available open Twitter data source, the 2.3 BiasinTwitterData Sample API. Unlike the Streaming API, the Sample API There are many potential areas of bias in Twitter. One takes no parameters and returns a 1% sample of all of the possible area of bias in Twitter data comes from the demo- Tweets produced on Twitter. Here, we empirically assess graphic makeup of the users on the site. In [13], the au- the Sample API’s ability to return a truly random sample thorsuselastnamesofuserstoestimatetheirrace,andfind and continue to build a framework to compare the Stream- thattheethnicity/racedistributionofTwitterusersdiverges ing API data with the Firehose using the Sample API as a widely from U.S. Census estimates. [8] finds similar results, proxy. We call the trend from the Sample API s(h). Given and also finds that adults aged 18-29, African Americans, thesparsityoftheSampleAPI,itislikelyimpossibletofind and urban residents are over-represented on the site. the real trend f(h) from s(h), however s(h) can be used as an indicator to understand when t(h) is biased. Thebiaswefocusoninthisworkisconcernedwithsample biasfromTwitter’sAPIs. Theworkperformedin[15]com- 3.2 VettingtheRandomnessoftheSampleAPI pared four commonly-studied facets of the Streaming API GiventheevidenceofbiasthatwasobservedintheStream- 4https://dev.twitter.com/docs/api/1/get/statuses/ ing API, we must proceed with caution before using the sample SampleAPIasasurrogategoldstandard. Webeginouras- Top Hashtags - Sample API and GNIP Firehose Table 1: Data Collected to Test Bias Detection 1.0 Data Source Keywords No. Tweets 0.9 Sample API Firehose (Gnip) syria 214,383 0.8 Randomly Sampled GNIP Data Sample API N/A 734,172 tau0.7 0.6 0.5 0.4 0 50 100 150 200 250 300 350 400 450 Table 2: Significance levels of τ statistic for top k β n hashtags, Sample API VS. Firehose. Top-k τ p-value β Figure1: RankcorrelationofSampleAPIandGnip 10 0.988826 0.000069 Firehose. Relationship between n - number of top 20 0.778663 0.000001 hashtags, and τ - the correlation coefficient for dif- β 30 0.655072 0.000001 ferent levels of coverage. 40 0.549759 0.000002 50 0.604880 <10−6 ... ... ... distributions, we go forward with the understanding that 450 0.476931 <10−6* thetophashtagsfromtheSampleAPIarerepresentativeof *All lists of size greater than the true activity on the Firehose. 40 had p-values <10−6. WiththerevelationthattheSampleAPIisarandomsample of the Firehose, one might be tempted to simply use this sessmentoftherandomnessoftheSampleAPIbycollecting as a replacement for the Streaming API, post-filtering the data from the feed. We collect all of the tweets available SampleAPI’sdatatosuittheirneeds. Whilethisiscorrect through the Sample API on 2013-08-30 from 17:00 - 21:00 from the perspective of data sampling, this approach will UTC,andpost-filterthembythekeyword“syria”. Simulta- lead to a massive loss in data. For example, assume that a neously, we collect all of the tweets matching the keyword query matches 10% of the data on the Firehose. Filtering “syria”from the Gnip Twitter feed. The Gnip5 feed is an- the Sample API will give us a dataset that is 0.1% of the other outlet for Firehose data, it also provides 100% of the size of the Firehose, but unbiased. The Streaming API’s publicly-available Tweets on Twitter. We report the key- sometimes-biasedsamplingmethodwillgiveusa1%sample wordsandthecollectionvolumeforthisdatasetinTable1. oftheFirehosethatis10xlargerthantheSampleAPI.This may be counterintuitive as larger samples have a greater To verify the validity of this source we compare the ranked chance of being unbiased, however since the issue resides in list of top hashtags in both sets. We first plot Kendall’s τ thesamplingmethodologyoftheStreamingAPIasampleof β score of the Sample API against the Firehose. Kendall’s τ anysizefromthissourcehasthepotentialtobebiased. The β calculatesthenumberofconcordantanddiscordantpairsbe- size difference between the two sources cannot be ignored. tweentworankedlists. Thisscoregivesusasenseofwhether Instead, we will use the Sample API to identify periods of thefrequencyofhashtagscomingthroughtheSampleAPIis bias in the Streaming API. thesameastheFirehose. Weplottheaverageandstandard deviationof100perfectlyrandomlysampleddatasetsofthe samesizeastheSampleAPIagainsttheFirehose. Aplotof 3.3 FindingBiasintheTrendofaHashtag the rank correlation in the top hashtags from the Firehose Now that we know that the Sample API gives us an unbi- and the Sample API is shown in Figure 1. ased picture of Twitter’s Firehose, we can continue to con- structaframeworkthatincorporatesthissourcetofindbias. OverallweseethattheshapeofthetrendoftheSampleAPI Herein,weproposeamethodologythatfindsthebiasinthe closely resembles that of the random samples, a promising Streaming API and reports to the user collecting the data signthattheSampleAPIisunbiased. However,westillsee when there is likely bias in the data. that the τ values occasionally fall outside of one standard β deviationofthedistributionofrandomsamples. Weperform WithonlyoneunbiasedviewfromtheSampleAPI,itisdiffi- astatisticaltesttoensurethattheSampledataandFirehose culttounderstandwhatweshouldexpectfromourStream- data are not independent. To perform the statistical test ing API data. When the results from both sources match we calculate two-sided significance level from the Kendall τ there is clearly no problem. When there is a difference how statistic, which tests the following hypothesis: do we know if the relative error between the Sample API andtheStreamingAPIatonetimestepissignificantorifit H – The top k hashtags in the Firehose data and 0 is just a small deviation from a random sample? To better the top k hashtags in the Sample API data are understandtheSampleAPI’sresponse,webootstrap[9]the independent, τ = 0. β Sample API to obtain a confidence interval for the relative activity for the hashtag at a given time step. The results of this experiment at varying levels of k are showninTable2. InallcasesweareabletorejectH with 0 WebeginbynormalizingboththeSampleAPIandStream- a95%confidencelevel. Giventhestrongcorrelationbetween ing API time series. This is done by calculating the mean, the top hashtags and the strong similarity between the two andstandarddeviationofeachofthecountsinthetimese- 5http://gnip.com/ ries. Finally,wenormalizeeachpointbyitsstandardscore, Streaming API Sample API 4 4 3 3 2 2 malized Occurrence 011 malized Occurrence 011 Nor Nor 2 2 3 3 4 4 0 5 10 15 20 0 5 10 15 20 Hours on 2013-08-05 Hours on 2013-08-05 (a)StreamingAPIResults. Trendlinefor“#believemovie” (b) Sample API Results. Trendline for “#believemovie” over one day. over one day. Bootstrapped Samples Bootstrapped Distribution + Streaming API 4 4 3 3 2 2 malized Occurrence 011 malized Occurrence 011 Nor Nor 2 2 3 3 4 4 0 5 10 15 20 0 5 10 15 20 Hours on 2013-08-05 Hours on 2013-08-05 (c)Trendlinesfrom100bootstrappedsamplesoftheSam- (d) Bootstrapped average and ±3 standard deviations ple API. overlaid with Streaming API. Figure 2: Figures outlining different steps of the process for finding bias in the Streaming API. which is calculated as: on August 5th, 2013. We choose this hashtag as it is one t −µ of the most frequent hashtags on this day. The process be- Standard Score(t )= i T, (1) i σ ginswiththetimeseriesdataforthishashtagfromboththe T StreamingandSampleAPIs,showninFigures2(a)and2(b), whereµ andσ arethemeanandstandarddeviationofall T T respectively. Looking at the two figures, we immediately ofthetimeperiodsinthetimeseries,respectively,andt is i see a difference in the trends of this hashtag from the two an individual time series point. This is done to ensure that sources. Toobtainaconfidenceintervalonthedifferenceof the distribution of points from both time series is N(0,1). the two sources, we create 100 bootstrapped samples. The We create 100 bootstrapped samples for each hashtag. We time series extracted from these samples are shown in Fig- thenextractthetimeseriesdatafromeachsampleandnor- ure2(c). Finally,takingthemeanand3standarddeviations malizethemaswedidbefore. Thisgivesusadistributionof of the bootstrapped samples at each time point, we obtain readingsforeachtimeperiodinthedataset. Next,wecom- the confidence intervals seen in Figure 2(d). We make sev- parethisdistributiontothenormalizedtimeseriesfromthe eralobservationsfromFigure2(d). First,aspikethatoccurs StreamingAPItodetectthebias. Wetakethesamplemean betweenhours10and11isunderrepresentedintheStream- and sample standard deviation of this distribution at each ing data. Also, the spikes that appear after hour 16 are pointt asµb andσb. Borrowingthethresholdusedincon- i i i bothover-representedintheStreamingAPI.Duetothena- trol charts [17], we say that any Streaming API value at tureofourbootstrappingmethod,theseobservationsareall time t that is outside of ±3σ is biased. i ti statistically significant at the 99.7% confidence interval. We show a full example of our method in Figure 2. We enumeratetheprocessforasinglehashtag,“#believemovie” 3.4 SignalUsabilityUnderSparsity Known Zeros by Hashtag Rank Table 3: Number of Comparisons, Median, Aver- ulative) 22050000 aSgceo,reasndacSrotassndaallrdcoDmepvaiaritsioonnso.fBTewcaitutseer tIhDeJtaecmcaprod- m Cu1500 ralcomparisonsarebetweenquery, wehaveoneless Zeros (1000 thCaonmipnartishoengeographic comNparisMoned.ian Mean STD wn 500 Geographic Comparison 194 0.976 0.941 0.092 no K 0 USA Time Comparison 193 0.996 0.995 0.003 1 10 100 1000 Hashtag Rank Austria Time Comparison 193 0.996 0.942 0.186 Figure 3: Cumulative known zeros ranked by hash- Toanswerthesequestions,wecollecteddatafromtheStream- tag popularity. We see that the most popular hash- ingAPIinboththeUnitedStates(USA)andAustria(AT) tagshaverelativelyfewpointsofmissingdata,while with the following scheme: every 20 minutes, we start a the less popular hashtags have many more. To querythatlastsfor30minutes. Forexample,queryUSAand helpthereaderseparatethehigher-rankedhashtags’ 1 queryAT collect tweets from 00:00 - 00:30 UTC, queryUSA missing values, we plot the x-axis on a log scale. 1 2 and queryAT from 00:20 - 00:50 UTC, and queryUSA and 2 3 queryAT from 00:40 -01:10 UTC, andsoon. Eachquery is 3 configuredwithexactlythesameparameters. Instructuring Onepotentialdrawbackofourmethodliesinthesparsityof our queries this way, we can control both for time, and for theSampleAPI.Accountingforonly1%oftheFirehose,the location. By looking at the 10-minute overlaps in the adja- “longtail”ofhashtagswilllargelybeignoredbytheSample cent within-country queries (i.e. all queryC and queryC ), i i+1 API. This is problematic for researchers who wish to verify we can gain an understanding of whether identical queries their Streaming API query’s results when their query is fo- startedatdifferenttimesgetthesameresults. Bylookingat cused upon hashtags that do not see a lot of activity as a entirequeriesacrosscountries(i.e. queryUSAandqueryAT), i i fractionoftheentireactivityonTwitter. Onecounterargu- we can understand whether identical queries started at the mentisthatthesekindsofquerieswilllikelynoteclipsethe the same timefromdifferentlocations get the same results. 1% threshold, and in general will be unbiased. The data we collected spans from 2013-10-20 06:20 UTC Oneobservationwemakeisthatthetimeswhereourboot- - 2013-10-22 22:20 UTC. Each query starts exactly on the strapping method will be futile is in times where there is 20-minute interval and lasts for exactly 30 minutes. In this data for the hashtag from the Streaming API and no data way, we collect 194 datasets in total from each country. In from the Sample API. In such cases, a bootstrapping ap- thebetween-countrycase,wecomparetheentiredatasetas proach will give us a degenerate distribution with mean 0, bothqueriesarerunninginbothlocations. Inthebetween- notallowingformeaningfulcomparisonbetweenthesources. timecase,weonlycomparethe10-minuteoverlapsbetween We test the sparsity of the Sample API by finding“known queryC and queryC . i i+1 zeros”, hashtags seen in the results of the Streaming API query,butnotintheSampleAPIforaparticulartimeunit. 4.1 Between-CountryResults To compare the datasets, we calculate the Jaccard score of Figure 3 shows the number of known zeros in the top 1,000 the Tweet IDs from queryUSA and queryAT. We then take hashtags,witheachhashtagorderedbyfrequency. Here,we i i the median, average, and standard deviation of these Jac- see that the first hashtags are nearly perfect, with a total card scores. These results are reported in the first row of of only 4 known zeros in the top 10 hashtags. However, as Table 3. Here, we see a very high average and a very low we continue down the list, we begin to see more and more standard deviation between the Jaccard scores, indicating known-zeros. While this method helps researchers to find that the results from these two queries are nearly identical. bias in their Streaming API queries, there are still many These results indicate that no preference was given to the hoursformanyhashtagswherenoclaimcanbemadeabout queriesoriginatingintheUnitedStates. Wearehopefulthat the validity of the data. researchers outside of the US will obtain similar results. 4.2 Between-TimeResults 4. GEOGRAPHIC AND TEMPORAL STA- To compare the datasets, we fix the country C and calcu- BILITYOFQUERIES late the Jaccard score of the Tweet IDs from queryC and i In addition to tackling the bias problem, we also analyze queryC . The results for the USA and Austria queries i+1 the stability of the results when the data are collected in are shown in rows 2 and 3 of Table 3, respectively. In the different geographic areas, and for queries started at differ- case of the USA, we see even stronger results, with an ex- ent times. Do identical Streaming API queries started at tremely high average and an extremely low standard devi- different times get different responses during the overlap of ation. Here, we can see that the overlapping times receive their execution? Do identical Streaming API queries get practically the same dataset in all cases. In the case of the different responses if they are executed from different geo- Austriandatasets,weseethatthereisawiderdistributionof graphical regions? To ensure that the results obtained in Jaccardscoresbetweenquerywindows,howeverwecontinue this paper hold for researchers outside of the US, we assess toseeanextremelyhighmean,whichgivesusconfidencein whetherTwitterissuesthesameresultstoidenticalqueries. the coverage in these results. 5. DISCUSSIONANDFUTUREWORK 8(3):e57410, 2013. In this work we ask how to find bias in the Streaming API [7] M. De Choudhury, Y.-R. Lin, H. Sundaram, K. S. without the need for costly Firehose data. We test the rep- Candan, L. Xie, and A. Kelliher. How Does the Data resentativityofanotherfreely-availableTwitterdatasource, Sampling Strategy Impact the Discovery of the Sample API. We find that overall the tweets that come Information Diffusion in Social Media. In ICWSM, through the Sample API are a representative sample of the pages 34–41, 2010. true activity on Twitter. We propose a solution to har- [8] M. Duggan and J. Brenner. The Demographics of nessthisdatasourcetofindtimeperiodswheretheStream- Social Media Users, 2012. Pew Research Center’s ing API is likely biased. Finally, we show that results ob- Internet & American Life Project, 2013. tained through the Streaming API for a given time period [9] B. Efron. The Jackknife, The Bootstrap and Other see a significant amount of overlap between queries gener- Resampling Plans. Society for Industrial and Applied ated from both the United States and from Austria, and Mathematics, 1987. between queries started at different points in time. [10] L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In Proc. First Workshop on One question that arises is how to integrate the results of Social Media Analytics, SOMA ’10, pages 80–88, ourframeworkintoanindividual’sresearch. Onesolutionis NYC, NY, USA, 2010. ACM. tofocusresearcheffortsontimeperiodswhereno(orlittle) [11] P. D. Kirk and M. P. Stumpf. Gaussian process bias is found in the dataset. Another potential solution is regression bootstrapping: exploring the effects of to purchase Twitter data from a reseller such as Topsy6 or uncertainty in time course data. Bioinformatics, Gnip, but only paying for the biased time periods. A third 25(10):1300–1306, 2009. solution is to incorporate other forms of social media such [12] M. Mathioudakis and N. Koudas. Twittermonitor: as blogs. This allows for a multifaceted view of the data, trend detection over the twitter stream. In SIGMOD, andcangivetheresearchermoredepthintimeswheretheir pages 1155–1158, 2010. Twitter data may be of question. One way to incorporate [13] A. Mislove, S. Lehmann, Y.-Y. Ahn, J.-P. Onnela, and these other views is to cross-reference users from Twitter J. N. Rosenquist. Understanding the demographics of with other social media outlets, one such solution has been twitter users. In ICWSM, 2011. proposed in [20]. [14] F. Morstatter, J. Pfeffer, and H. Liu. When is it Wehavestudiedthefeasibilityofourmethodwiththesparse Biased? Assessing the Representativeness of Twitter’s signal that we get from the Sample API. Overall, we find Streaming API. In WWW Web Science, 2014. thatthismethodcanbeusedwhenthequeryissuedbythe [15] F. Morstatter, J. Pfeffer, H. Liu, and K. Carley. Is the researcherreceivesalotofattentionfromTwitterusers. We Sample Good Enough? Comparing Data from also find that this method is less useful when the query re- Twitter’s Streaming API with Twitter’s Firehose. In ceiveslessdata. Onepotentialwaytoalleviatethisproblem ICWSM, 2013. is to use other bootstrapping methods such that proposed [16] A. Pozdnoukhov and C. Kaiser. Space-time dynamics in[11],whichtakesintoaccountneighboringvaluestocom- of topics in streaming text. In LBSN, 2011. pute the values. Future work will attempt to find bias in [17] T. P. Ryan. Statistical Methods for Quality sparse data scenarios, and adapting these methods to the Improvement, volume 840. Wiley.com, 2011. speed and ephemerality of Twitter data. [18] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. Cheap and Fast - But is it Good? Evaluating 6. REFERENCES Non-Expert Annotations for Natural Language Tasks. [1] Humanitarianism in the Network Age. United Nations In EMNLP, pages 254–263, 2008. Office for the Coordination of Humanitarian Affairs, [19] M. Sofean and M. Smith. A real-time architecture for 2012. detection of diseases using social networks: design, [2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent implementation and evaluation. In Hypertext, pages dirichlet allocation. The Journal of Machine Learning 309–310, 2012. Research, 3:993–1022, 2003. [20] R. Zafarani and H. Liu. Connecting Users across [3] D. G. Campbell. Egypt Unshackled: Using Social Social Media Sites: A Behavioral-Modeling Approach. Media to @#:) the System. Cambria Books, Amherst, In KDD, pages 41–49, 2013. NY, 2011. [4] C. Castillo, M. El-Haddad, J. Pfeffer, and M. Stempeck. Characterizing the life cycle of online news stories using social media reactions. arXiv preprint arXiv:1304.3010, 2013. [5] Z. Cheng, J. Caverlee, and K. Lee. You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users. In CIKM, pages 759–768, 2010. [6] M. J. Crump, J. V. McDonnell, and T. M. Gureckis. Evaluating amazon’s mechanical turk as a tool for experimental behavioral research. PloS one, 6http://www.topsy.com

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.