Semantic Annotation for Microblog Topics Using Wikipedia Temporal Information TuanTran NamKhanhTran AsmelashTekaHadgu RobertJa¨schke L3SResearchCenter L3SResearchCenter L3SResearchCenter L3SResearchCenter Hannover,Germany Hannover,Germany Hannover,Germany Hannover,Germany [email protected] [email protected] [email protected] [email protected] Abstract Hard to believe anyone can do worse than Russia in #Sochi. Brazil seems to be trying pre;y hard though! spor=ngnews.com… Trending topics in microblogs such as #sochi Sochi 2014: Record number of posi=ve tests -‐ SkySports: q.gs/6nbAA Twitter are valuable resources to under- stand social aspects of real-world events. #Sochi Sea Port. What a 7 Toenabledeepanalysesofsuchtrends,se- beau=ful site! #Russia 1 manticannotationisaneffectiveapproach; 0 2 yet the problem of annotating microblog trending topics is largely unexplored by n a the research community. In this work, we J tackle the problem of mapping trending 2014_Winter_Olympics 4 Twitter topics to entities from Wikipedia. Port_of_Sochi 1 We propose a novel model that comple- ] mentstraditionaltext-basedapproachesby Figure 1: Example of trending hashtag annota- R tion. During the 2014 Winter Olympics, the hash- rewarding entities that exhibit a high tem- I . tag‘#sochi’hadadifferentmeaning. s poral correlation with topics during their c burst time period. By exploiting temporal [ information from the Wikipedia edit his- a hashtag within a short time period can lead to 1 toryandpageviewlogs,wehaveimproved v bursts and often reflect trending social attention. 9 theannotationperformanceby17-28%,as Understanding the meaning of trending hashtags 3 comparedtothecompetitivebaselines. offers a valuable opportunity for various applica- 9 3 tions and studies, such as viral marketing, social 0 1 Introduction behavior analysis, recommendation, etc. Unfor- . 1 With the proliferation of microblogging and its tunately, the task of hashtag annotation has been 0 7 wide influence on how information is shared and largelyunexploredsofar. 1 digested, the studying of microblog sites has In this paper, we study the problem of annotat- : v gainedinterestinrecentNLPresearch. Severalap- ing trending hashtags on Twitter by entities de- Xi proacheshavebeenproposedtoenableadeepun- rived from Wikipedia. Instead of establishing a r derstanding of information on Twitter. An emerg- static semantic connection between hashtags and a ing approach is to use semantic annotation tech- entities, we are interested in dynamically linking niques, for instance by mapping Twitter informa- the hashtags to entities that are closest to the un- tion snippets to canonical entities in a knowledge derlying topics during burst time periods of the baseortoWikipedia(Meijetal.,2012;Guoetal., hashtags. For instance, while ‘#sochi’ refers to 2013),orbyrevisitingNLPtasksintheTwitterdo- a city in Russia, during February 2014, the hash- main (Owoputi et al., 2013; Ritter et al., 2011). tag was used to report the 2014 Winter Olympics Much of the existing work focuses on annotating (cf. Figure 1). Hence, it should be linked more a single Twitter message (tweet). However, infor- toWikipediapagesrelatedtotheeventthantothe mationinTwitterisrarelydigestedinisolation,but location. rather in a collective manner, with the adoption of Compared to traditional domains of text (e.g., special mechanisms such as hashtags. When put news articles), annotating hashtags poses addi- together,theunprecedentedlymassiveadoptionof tional challenges. Hashtags’ surface forms are very ad-hoc, as they are chosen not in favor of graph-based methods (Cassidy et al., 2012; Liu et the text quality, but by the dynamics in attention al., 2013) use all related tweets (e.g., posted by a of the large crowd. In addition, the evolution user) together. However, most of them focus on of the semantics of hashtags (e.g., in the case of entity mentions in tweets. In contrast, we take ‘#sochi’) makes them more ambiguous. Further- into account hashtags which reflect the topics dis- more,ahashtagcanencodemultipletopicsatonce. cussed in tweets, and leverage external resources Forexample,inMarch2014,‘#oscar’referstothe from Wikipedia (in particular, the edit history and 86th Academy Awards, but at the same time also pageviewlogs)forsemanticannotation. to the Trial of Oscar Pistorius. Sometimes, it is Analysis of Twitter Hashtags In an attempt to difficult even for humans to understand a trending understand the user interest dynamics on Twitter, hashtag without knowledge about what was hap- a rich body of work analyzes the temporal pat- peningwiththerelatedentitiesintherealworld. terns of popular hashtags (Lehmann et al., 2012; In this work, we propose a novel solu- Naaman et al., 2011; Tsur and Rappoport, 2012). tion to these challenges by leveraging temporal Fewworkshave paidattentiontothe semanticsof knowledge about entity dynamics derived from hashtags, i.e., to the underlying topics conveyed Wikipedia. Wehypothesizethatatrendinghashtag in the corresponding tweets. Recently, Bansal et isassociatedwithanincreaseinpublicattentionto al. (2015) attempt to segment a hashtag and link certain entities, and this can also be observed on each of its tokens to a Wikipedia page. However, Wikipedia. As in Figure 1, we can identify 2014 the authors only aim to retrieve entities directly WinterOlympicsasaprominententityfor‘#sochi’ mentioned within a hashtag, which are very few duringFebruary2014,byobservingthechangeof inpractice. Theexternalinformationderivedfrom userattentiontotheentity,forinstanceviathepage the tweets is largely ignored. In contrast, we ex- view statistics of Wikipedia articles. We exploit ploitbothcontextinformationfromthemicroblog both Wikipedia edits and page views for annota- andWikipediaresources. tion. We also propose a novel learning method, inspiredbytheinformationspreadingnatureofso- EventMiningUsingWikipedia Recentlysome cial media such as Twitter, to suggest the optimal works exploit Wikipedia for detecting and ana- annotations without the need for human labeling. lyzing events on Twitter (Osborne et al., 2012; Insummary: Tolomei et al., 2013; Tran et al., 2014). However, mostoftheexistingstudiesfocusonthestatistical • WearethefirsttocombinetheWikipediaedit signalsofWikipedia(suchastheeditorpageview history and page view statistics to overcome volumes). We are the first to combine the content thetemporalambiguityofTwitterhashtags. oftheWikipediaedithistoryandthemagnitudeof pageviewstohandletrendingtopicsonTwitter. • We propose a novel and efficient learning al- gorithm based on influence maximization to 3 Framework automatically annotate hashtags. The idea is generalizable to other social media sites that Preliminaries We refer to an entity (denoted haveasimilarinformationspreadingnature. by e) as any object described by a Wikipedia ar- ticle (ignoring disambiguation, lists, and redirect • We conduct thorough experiments on a real- pages). Thenumberoftimesanentity’sarticlehas world dataset and show that our system can beenrequestediscalledtheentityviewcount. The outperformcompetitivebaselinesby17-28%. text content of the article is denoted by C(e). In thiswork,wechoosetostudyhashtagsatthedaily 2 RelatedWork level, i.e., from the timestamps of tweets we only Entity Linking in Microblogs The task of se- consider their creation day. A hashtag is called manticannotationinmicroblogshasbeenrecently trending at a time point (a day) if the number of tackledbydifferentmethods,whichcanbedivided tweetswhereitappearsissignificantlyhigherthan into two classes, i.e., content-based and graph- that on other days. There are many ways to de- based methods. While the content-based methods tectsuchtrendings. (Lappasetal.,2009;Lehmann (Meij et al., 2012; Guo et al., 2013; Fang and et al., 2012). Each trending hashtag has one or Chang, 2014) consider tweets independently, the multiplebursttimeperiods,surroundingthetrend- ing day, where the users’ interest in the underly- sivecomputationalcosts. Therefore,toguaranteea ing topic remains stronger than in other periods. goodrecallinthisstepwhilestillmaintainingfea- WedenotewithT(h)(orT forshort)onehashtag siblecomputation,weapplyentitylinkingonlyon bursttimeperiod,andwithD (h)thesetoftweets arandomsampleofthecompletetweetset. Then, T containingthehashtaghcreatedduringT. for each candidate entity e, we include all entities whose Wikipedia article is linked with the article Task Definition Given a trending hashtag h and ofebyanoutgoingorincominglink. the burst time period T of h, identify the top-k mostprominententitiestodescribehduringT. 3.2 MeasuringEntity–HashtagSimilarities It is worth noting that not all trending hashtags Toranktheentitybyprominence, wemeasurethe aremapabletoWikipediaentities,asthecoverage similarity between each candidate entity and the oftopicsinWikipediaismuchlowerthanonTwit- hashtag. Westudythreetypesofsimilarities: ter. This is also the limitation of systems relying onWikipediasuchasentitydisambiguation,which Mention Similarity This measure relies on the canonlydisambiguatepopularentitiesandnotthe explicit mentions of entities in tweets. It assumes onesinthelongtail. Inthisstudy,wefocusonthe that entities directly linked from more prominent precision and the popular trending hashtags, and anchors are more relevant to the hashtag. It is es- leavetheimprovementofrecalltofuturework. timated using both statistics from Wikipedia and tweet phrases, and turns out to be surprisingly ef- Overview We approach the task in three steps. fectiveinpractice(FangandChang,2014). Thefirststepistoidentifyallentitycandidatesby Context Similarity For entities that are not di- checking surface forms of the constituent tweets rectly linked to mentions (the mention similar- of the hashtag. In the second step, we compute ity is zero) we exploit external resources instead. different similarities between each candidate and Their prominence is perceived by users via exter- the hashtag, based on different types of contexts, nalsources,suchaswebpageslinkedfromtweets, which are derived from either side (Wikipedia or or entity home pages or Wikipedia pages. By ex- Twitter). Finally, we learn a unified ranking func- ploiting the content of entities from these external tion for each (hashtag, entity) pair and choose the sources,wecancomplementtheexplicitsimilarity top-k entitieswiththehighestscores. Theranking metricsbasedonmentions. functionislearnedthroughanunsupervisedmodel andneedsnohuman-definedlabels. Temporal Similarity The two measures above relyonthetextualrepresentationandaredegraded 3.1 EntityLinking by the linguistic difference between the two plat- The most obvious resource to identify candidate forms. To overcome this drawback, we incorpo- entities for a hashtag is via its tweets. We follow rate the temporal dynamics of hashtags and enti- common approaches that use a lexicon to match ties, which serve as a proxy to the change of user each textual phrase in a tweet to a potential en- intereststowardstheunderlyingtopics(Ciglanand tityset(Shenetal.,2013;FangandChang,2014). Nørva˚g, 2010). We employ the correlation be- OurlexiconisconstructedfromWikipediapageti- tweenthetimesseriesofhashtagadoptionandthe tles,hyperlinkanchors,redirects,anddisambigua- entityviewasthethirdsimilaritymeasure. tion pages, which are mapped to the correspond- ing entities. As for the tweet phrases, we extract 3.3 RankingEntityProminence all n-grams (n ≤ 5) from the input tweets within While each similarity measure captures one evi- T. We apply the longest-match heuristic (Meij et dence of the entity prominence, we need to unify al., 2012): We start with the longest n-grams and all scores to obtain a global ranking function. In stop as soon as the entity set is found, otherwise this work, we propose to combine the individual wecontinuewiththesmallerconstituentn-grams. similaritiesusingalinearfunction: Candidate Set Expansion While the lexicon- f(e,h) = αf (e,h)+βf (e,h)+γf (e,h) (1) m c t based linking works well for single tweets, ap- plying it on the hashtag level has subtle implica- whereα,β,γ aremodelweightsandf ,f ,f are m c t tions. Processingahugeamountoftext,especially the similarity measures based on mentions, con- during a hashtag burst time period, incurs expen- text, and temporal information, respectively, be- tween the entity e and the hashtag h. We further entity context. We collect the revisions of articles constrain that α+β +γ = 1, so that the ranking duringthetimeperiodT,plusonedaytoacknowl- scores of entities are normalized between 0 and 1, edge possible time lags. We compute the differ- and that our learning algorithm is more tractable. ence between two consecutive revisions, and ex- The algorithm, which automatically learns the pa- tract only the added text snippets. These snippets rameters without the need of human-labeled data, are accumulated to form the temporal context of isexplainedindetailinSection5. an entity e during T, denoted by C (e). The dis- T tribution of a word w for the entity e is estimated 4 SimilarityMeasures byamixturebetweentheprobabilityofgenerating w from the temporal context and from the general We now discuss in detail how the similarity mea- contextC(e)oftheentity: suresbetweenhashtagsandentitiesarecomputed. Pˆ(w|e) = λPˆ(w|M )+(1−λ)Pˆ(w|M ) 4.1 Link-basedMentionSimilarity CT(e) C(e) where M and M are the language mod- The similarity of an entity with one individual CT(e) C(e) els of e based on C (e) and C(e), respec- mention in a tweet can be interpreted as the prob- T tively. The probability Pˆ(w|M ) can be re- abilistic prior in mapping the mention to the en- C(e) gardedascorrespondingtothebackgroundmodel, tityviathelexicon. Onecommonwaytoestimate while Pˆ(w|M ) corresponds to the fore- the entity prior exploits the anchor statistics from CT(e) ground model in traditional language modeling Wikipedialinks,andhasbeenproventoworkwell settings. Here we use a simple maximum like- in different domains of text. We follow this ap- proach and define LP(e|m) = |lm(e)| as the lihood estimation to estimate these probabilities: linkprioroftheentityegivenam(cid:80)emn(cid:48)ti|lomn(cid:48)(me)|,where Pˆ(w|MC(e)) = |tCfw(e,)c| and Pˆ(w|MCT(e)) = lm(e) is the set of links with anchor m that point |tCfwT,(ceT)|, where tfw,c and tfw,cT are the term fre- toe. Thementionsimilarityf ismeasuredasthe quencies of w in the two text sources of C(e) m aggregation of link priors of the entity e over all and C (e), respectively, and |C(e)| and |C (e)| T T mentionsinalltweetswiththehashtagh: are the lengths of the two texts, respectively. We use the same estimation for tweets: Pˆ(w|h) = (cid:88) fm(e,h) = (LP(e|m)·q(m)) (2) tfw,D(h), where D(h) is the concatenated text of |D(h)| m all tweets of h in T. We use and normalize the Kullback-Leibler divergence to compare the dis- whereq(m)isthefrequencyofthementionmover tributions over all words appearing both in the allmentionsofeinalltweetsofh. Wikipediacontextsandthetweets: 4.1.1 ContextSimilarity (cid:88) Pˆ(w|e) To compute fc, we first construct the contexts for KL(e (cid:107) h) = Pˆ(w|e)· Pˆ(w|h) hashtags and entities. The context of a hashtag w is built by extracting all words from its tweets. f (e,h) = e−KL(e(cid:107)h) (3) c We tokenize and parse the tweets’ part-of-speech tags (Owoputi et al., 2013), and remove words 4.1.2 TemporalSimilarity of Twitter-specific tags (e.g., @-mentions, URLs, The third similarity, f , is computed using tem- t emoticons, etc.). Hashtags are normalized using poral signals from both sources – Twitter and thewordbreakingmethodbyWangetal.(2011). Wikipedia. For the hashtags, we build the time Thetextualcontextofanentityisextractedfrom series based on the volume of tweets adopt- its Wikipedia article. One subtle aspect is that the ing the hashtag h on each day in T: TS = h articles are not created at once, but are incremen- [n ,n ,...,n ]. Similarly for the entities, we 1 2 |T| tally updated over time in accordance with chang- buildthetimeseriesofviewcountsfortheentitye ing information about entities. Texts added in the inT: TS = [v ,v ,...,v ]. A time series sim- e 1 2 |T| same time period of a trending hashtag contribute ilarity metric is then used to compute f . Several t more to the context similarity between the entity metrics can be used, however most of them suf- andthehashtag. Basedonthisobservation,weuse fer from the time lag and scaling discrepancy, or the Wikipedia revision history – an archive of all incur expensive computational costs (Radinsky et revisions of Wikipedia articles – to calculate the al., 2011). In this work, we employ a simple yet #love #Sochi 2014: Russia's ice hockey dream Vladimir_Pu>n ends as Vladimir Pu=n watches on … Sochi Russia #sochi Sochi: Team USA takes 3 more medals, tops leaderboard | h;p://abc7.com h;p://adf.ly/dp8Hn 2014_Winter_Olympics Russia_men’s_na>onal #Sochi bear aWer #Russia's hockey team _ice_ hockey_team eliminated with loss to #Finland Finland Ice_hockey_at_the_2014_ I'm s=ll happy because Finland won. Is that too Winter_Olympics stupid..? #Hockey #Sochi … United_States Ice_hockey Figure2: Excerptoftweetsabouticehockeyresultsinthe2014WinterOlympics(left),andtheobserved linkingprocessbetweentime-alignedrevisionsofcandidateWikipediaentities(right). Linkscomemore from prominent entities to marginal ones to provide background, or more context for the topics. Thus, startingfromprominententities,wecanreachmoreentitiesinthegraphofcandidateentities effective metric that is agnostic to the scaling and astheyarelargelydependentontheusers’diverse timelagoftimeseries(YangandLeskovec,2011). attention to each sub-event. This heterogeneity of It measures the distance between two time series hashtags calls for a different premise, abandoning byfindingoptimalshiftingandscalingparameters theideaofcoherence. tomatchtheshapeoftwotimeseries: Influence Maximization (IM) We propose a (cid:107)TS −δd (TS )(cid:107) f (e,h) = min h q e (4) new approach to find entities for a hashtag. We t q,δ (cid:107)TSh(cid:107) use an observed behavioral pattern in creating Wikipedia pages for guiding our approach to en- whered (TS )isthetimeseriesderivedfromTS q e e tityprominence: Wikipediaarticlesofentitiesthat byshiftingq timeunits,and(cid:107)·(cid:107)istheL norm. It 2 are prominent for a topic are quickly created or hasbeenproventhatEquation4hasaclosed-form updated,1 and subsequently enriched with links to solutionforδ givenfixedq,thuswecandesignan related entities. This linking process signals the efficient gradient-based optimization algorithm to dynamics of editor attention and exposure to the computef (YangandLeskovec,2011). t event(Keeganetal.,2011). Wearguethatthepro- 5 EntityProminenceRanking cessdoesnot,ortoamuchlesserdegree,happento more marginal entities or to very general entities. 5.1 RankingFramework AsillustratedinFigure2,theentitiesclosertothe Tounifytheindividualsimilaritiesintooneglobal 2014 Olympics get more updates in the revisions metric (Equation 1), we need a guiding premise of their Wikipedia articles, with subsequent links of what manifest the prominence of an entity to a pointing to articles of more distant entities. The hashtag. Suchapremisecanbeinstructedthrough direction of the links influences the shifting atten- manual assessment (Meij et al., 2012; Guo et al., tion of users (Keegan et al., 2011) as they follow 2013), but it requires human-labeled data and is thestructureofarticlesinWikipedia. biased from evaluator to evaluator. Other heuris- Weassumethat,similartoWikipedia,theentity tics assume that entities close to the main topic of prominencealsoinfluenceshowusersareexposed a text are also coherent to each other (Ratinov et and spread the hashtag on Twitter. In particular, al.,2011;Liuetal.,2013). Basedonthis,state-of- theinitialspreadingofatrendinghashtaginvolves the-art methods in traditional disambiguation es- moreentitiesinthefocusofthetopic. Subsequent timate entity prominence by optimizing the over- exposureandspreadingofthehashtagtheninclude allcoherenceoftheentities’semanticrelatedness. other related entities (e.g., discussing background However, this coherence does not hold for topics orprovidingcontext),drivenbyinterestsindiffer- in hashtags: Entities reported in a big topic such ent parts of the topic. Based on this assumption, as the Olympics vary greatly with different sub- events. Theyarenotalwayscoherenttoeachother, 1Osborneetal.(2012)suggestedatimelagof3hours. we propose to gauge the entity prominence as its Algorithm1:EntityInfluence-ProminenceLearning potential in maximizing the information spreading Input :h,T,D (h),B,k,learningrateµ,threshold(cid:15) T withinallentitiespresentinthetweetsofthehash- Output:ω,top-kmostprominententities. tag. In other words, the problem of ranking the Initialize:ω:=ω(0) most prominent entities becomes identifying the Calculatef ,f ,f ,f :=f usingEqs.1,2,3,4 m c t ω ω(0) setofentitiesthatleadtothelargestnumberofen- whiletruedo ˆf :=normalizef tities in the candidate set. This problem is known ω ω Sets :=ˆf ,calculater usingEq.6 h ω h in social network research as influence maximiza- Sortr ,getthetop-kentitiesE(h,k) h (cid:80) tion(Kempeetal.,2003). if L(f(e,h),r(e,h))<(cid:15)then e∈E(h,k) Stop Iterative Influence-Prominence Learning (IPL) end ω:=ω−µ(cid:80) ∇L(f(e,h),r(e,h)) IM itself is an NP-hard problem (Kempe et al., e∈E(h,k) end 2003). Therefore, we propose an approxima- returnω,E(h,k) tion framework, which can jointly learn the in- fluence scores of the entity and the entity promi- nence together. The framework (called IPL) con- baselinemethodsuggestedbyLiuetal.(2014): tains several iterations, each consisting of two steps:(1) Pick up a model and use it to compute r := τBr +(1−τ)s (6) h h h the entity influence score. (2) Based on the influ- ence scores, update the entity prominence. In the where B is the influence transition matrix, s are h sequelwedetailourlearningframework. theinitialinfluencescoresthatarebasedontheen- tityprominencemodel(Step1ofIPL),andτ isthe 5.2 EntityGraph dampingfactor. Influence Graph To compute the entity influ- ence scores, we first construct the entity influence 5.3 LearningAlgorithm graphasfollows. Foreachhashtagh,weconstruct Now we detail the IPL algorithm. The objective a directed graph G = (E ,V ), where the nodes h h h is to learn the model ω = (α,β,γ) of the global E ⊆ E consist of all candidate entities (cf. Sec- h function (Equation 1). The general idea is that we tion 3.1), and an edge (e ,e ) ∈ V indicates that i j h find an optimal ω such that the average error with there is a link from e ’s Wikipedia article to e ’s. j i respecttothetopinfluencingentitiesisminimized Notethatedgesoftheinfluencegraphareinversed in direction to links in Wikipedia, as such a link (cid:88) ω = argmin L(f(e,h),r(e,h)) gives an “influence endorsement” from the desti- E(h,k) nationentitytothesourceentity. where r(e,h) is the influence score of e and h, EntityRelatedness Inthiswork,weassumethat E(h,k) is the set of top-k entities with highest an entity endorses more of its influence score to r(e,h), and L is the squared error loss function, highly related entities than to lower related ones. (x−y)2 We use a popular entity relatedness measure sug- L(x,y) = . 2 gestedbyMilneandWitten(2008): ThemainstepsaredepictedinAlgorithm1. We start with an initial guess for ω, and compute the MW(e1,e2) = 1− logl(omg(a|xE(||)I−1|l,o|Ig2(|m)−inlo(|gI(1||I,1|I∩2I|2))|))) similarities for the candidate entities. Here fm, fc, f ,andf representthesimilarityscorevectors. We t ω whereI andI aresetsofentitieshavinglinksto 1 2 use matrix multiplication to calculate the similari- e ande ,respectively,andE isthesetofallenti- 1 2 tiesefficiently. Ineachiteration,wefirstnormalize tiesinWikipedia. Theinfluencetransitionfrome i f such that the entity scores sum up to 1. A ran- ω toe isdefinedasthenormalizedvalue: j dom walk is performed to calculate the influence MW(ei,ej) scorerh. Thenweupdateωusingabatchgradient bi,j = (cid:80) (5) descentmethodonthetop-kinfluencerentities. To MW(e ,e ) (ei,ek)∈V i k derive the gradient of the loss function L, we first Influence Score Let r be the influence score remarkthatourrandomwalkEquation6issimilar h vector of entities in G . We can estimate r effi- tocontext-sensitivePageRank(Haveliwala,2002). h h cientlyusingrandomwalkmodels,similarlytothe Usingthelinearityproperty(Fogarasetal.,2005), TotalTweets 500,551,041 dayt: pt(h) = max|n(tn−b,nnbm|in),wherent isthenum- TrendingHashtags 2,444 ber of tweets containing h, nb is the median value TestHashtags 30 ofntoverallpointsina2-monthtimewindowcen- TestTweets 352,394 tered ont, andnmin = 10is the thresholdto filter DistinctMentions 145,941 lowactivityhashtags. Thehashtagisskippedifits Test(Entity,Hashtag)pairs 6,965 highest outlier fraction score is less than 15. Fi- CandidatesperHashtag(avg.) 50 nally,wedefinethebursttimeperiodofatrending ExtendedCandidates(avg.) 182 hashtag as the time window of size w, centered at dayt withthehighestp (h). 0 t0 Table1: Statisticsofthedataset. FortheWikipediadatasetsweprocessthedump from3rdMay2014,soastocoveralleventsinthe wecanexpressr(e,h)asthelinearfunctionofin- Twitter dataset. We have developed Hedera (Tran fluencescoresobtainedbyinitializingwiththein- and Nguyen, 2014), a scalable tool for process- dividual similarities f ,f , and f instead of f . ingtheWikipediarevisionhistorydatasetbasedon m c t ω Thederivativethuscanbewrittenas: Map-Reduce paradigm. In addition, we download the Wikipedia page view dataset that stores how ∇L(f(e,h),r(e,h)) = α(r (e,h)−f (e,h))+ many times a Wikipedia article was requested on m m β(r (e,h)−f (e,h))+γ(r (e,h)−f (e,h)) anhourlylevel. Weprocessthedatasetforthefour c c t t monthsofourstudyanduseHederatoaccumulate where r (e,h),r (e,h),r (e,h) are the compo- allviewcountsofredirectstotheactualarticles. m c t nents of the three vector solutions of Equation 6, eachhavings replacedbyf ,f ,f respectively. h m c t Sampling From the trending hashtags, we sam- Since both B and fˆ are normalized such that ω ple 30 distinct hashtags for evaluation. Since our their column sums are equal to 1, Equation 6 is study focuses on trending hashtags that are ma- convergent(Haveliwala,2002). Also,asdiscussed pable to entities in Wikipedia, the sampling must above, r is a linear combination of factors that h cover a sufficient number of “popular” topics that are independent of ω, hence L is a convex func- are seen in Wikipedia, and at the same time cover tion,andthebatchgradientdescentisalsoguaran- rare topics in the long tail. To do this, we apply teed to converge. In practice, we can utilize sev- several heuristics in the sampling. First, we only eral indexing techniques to significantly speed up consider hashtags where the lexicon-based link- thesimilarityandinfluencescorescalculation. ing (Section 3.1) results in at least 20 different 6 ExperimentsandResults entities. Second, we randomly choose hashtags to cover different types of topics (long-running 6.1 Setup events,breakingevents,endogenoushashtags). In- Dataset Thereisnostandardbenchmarkforour stead of inspecting all hashtags in our corpus, we problem,sinceavailabledatasetsonmicroblogan- follow Lehmann et al. (2012) and calculate the notation(suchastheMicropostschallenge(Basave fractionoftweetspublishedbefore,duringandaf- et al., 2014)) do not have global statistics, so we ter the peak. The hashtags are then clustered in cannot identify the trending hashtags. Therefore, this3-dimensionalvectorspace. Eachclustersug- we created our own dataset. We used the Twitter gests a group of hashtags with a distinct seman- API to collect from the public stream a sample of tics(Lehmannetal.,2012). Wethenpickuphash- 500,551,041 tweets from January to April 2014. tags randomly from each cluster, resulting in 200 We removed hashtags that were adopted by less hashtags in total. From this rough sample, three than 500 users, having no letters, or having char- inspectors carefully checked the tweets and chose acters repeated more than 4 times (e.g., ‘#oooom- 30hashtagswherethemeaningsandhashtagtypes mgg’). Weidentifiedtrendinghashtagsbycomput- werecertaintotheknowledgeoftheinspectors. ing the daily time series of hashtag tweet counts, andremovingthoseofwhichthetimeseries’vari- Parameter Settings We initialize the similarity ancescoreislessthan900. Toidentifythehashtag weightsto 1, thedampingfactortoτ = 0.85, and 3 burst time period T, we compute the outlier frac- theweightforthelanguagemodeltoλ = 0.9. The tion(Lehmannetal.,2012)foreachhashtaghand learningrateµisempiricallyfixedtoµ = 0.003. Tagme Wikiminer Meij Kauri M C T IPL P@5 0.284 0.253 0.500 0.305 0.453 0.263 0.474 0.642 P@15 0.253 0.147 0.670 0.319 0.312 0.245 0.378 0.495 MAP 0.148 0.096 0.375 0.162 0.211 0.140 0.291 0.439 Table2: Experimentalresultsonthesampledtrendinghashtags. Baseline We compare IPL with other entity an- 6.2 ResultsandDiscussion notation methods. Our first group of baselines in- Table 2 shows the performance comparison of the cludes entity linking systems in domains of gen- methods using the standard metrics for a ranking eral text, Wikiminer (Milne and Witten, 2008), system (precision at 5 and 15 and MAP at 15). In and short text, Tagme (Ferragina and Scaiella, general, all baselines perform worse than reported 2012). Foreachmethod,weusethedefaultparam- in the literature, confirming the higher complexity etersettings, applythemforthe individualtweets, of the hashtag annotation task as compared to tra- and take the average of the annotation confidence ditional tasks. Interestingly enough, using our lo- scores as the prominence ranking function. The calsimilaritiesalreadyproducesbetterresultsthan secondgroupofbaselinesincludessystemsspecif- TagmeandWikiminer. Thelocalmodelf signif- m ically designed for microblogs. For the content- icantly outperforms both the baselines in all met- based methods, we compare against Meij et al. rics. Combining the similarities improves the per- (2012),whichusesasupervisedmethodtoranken- formance even more significantly.2 Compared to titieswithrespecttotweets. Wetrainthemodelus- the baselines, IPL improves the performance by ing the same training data as in the original paper. 17-28%. The time similarity achieves the high- For the graph-based method, we compare against estresultcomparedtoothercontent-basedmention KAURI (Shen et al., 2013), a method which uses andcontextsimilarities. Thissupportsourassump- user interest propagation to optimize the entity tion that lexical matching is not always the best linking scores. To tune the parameters, we pick strategytolinkentitiesintweets. Thetimeseries- upfourhashtagsfromdifferentclusters,randomly based metric incurs lower cost than others, yet it sample 50 tweets for each, and manually annotate produces a considerably good performance. Con- the tweets. For all baselines, we obtained the im- text similarity based on Wikipedia edits does not plementation from the authors. The exception is yield much improvement. This can be explained Meij method, where we implemented ourselves, in two ways. First, information in Wikipedia is butweclarifiedwiththeauthorsviaemailsonsev- largely biased to popular entities, it fails to cap- eral settings. In addition, we also compare three ture many entities in the long tail. Second, lan- variants of our method, using only local functions guage models are dependent on direct word rep- for entity ranking (referred to as M, C, and T for resentations, which are different between Twitter mention,context,andtime,respectively). and Wikipedia. This is another advantage of non- contentmeasuressuchasf . Evaluation In total, there are 6,965 entity- t For the second group of baselines (Kauri and hashtag pairs returned by all systems. We employ Meij), we also observe the reduction in precision, five volunteers to evaluate the pairs in the range especially for Kauri. This is because the method from 0 to 2, where 0 means the entity is noisy or relies on the coherence of user interests within a obviouslyunrelated,2meanstheentityisstrongly group of tweets to be able to perform well, which tied to the topic of the hashtag, and 1 means that does not hold in the context of hashtags. One as- although the entity and hashtag might share some tonishing result is that Meij performs better than common contexts, they are not involved in a di- IPLintermsofP@15. However,itperformsworse rect relationship (for instance, the entity is a too in terms of MAP and P@5, suggesting that most general concept such as Ice hockey, as in the case ofthecorrectlyidentifiedentitiesarerankedlower illustrated in Figure 2). The annotators were ad- in the list. This is reasonable, as Meij attempts to visedtousesearchengines,theTwittersearchbox optimize (with human supervision effort) the se- or Wikipedia archives whenever applicable to get more background on the stories. Inter-annotator 2All significance tests are done against both Tagme and agreementunderFleissscoreis0.625. Wikiminer,withap-value<0.01. 0.6 dowof2months(wherethehashtagtimeseriesis Endogenous 0.5 constructed and a trending time is identified), our Exogenous 0.4 methodisstableandalwaysoutperformsthebase- lines by a large margin. Even when the trending 0.3 hashtaghasbeensaturated,henceintroducedmore 0.2 noise,ourmethodisstillabletoidentifythepromi- 0.1 nententitieswithhighquality. 0 Tagme WM Meij Kauri M C T IPL 7 ConclusionandFutureWork Figure3: Performanceofthemethodsfordifferent typesoftrendinghashtags. In this work, we address the new problem of topically annotating a trending hashtag using Wikipedia entities, which has many important ap- 0.50 Kauri Wikiminer plications in social media analysis. We study 0.45 Tagme IPL 0.40 Wikipedia temporal resources and find that using 0.35 efficient time series-based measures can comple- 0.30 P ment content-based methods well in the domain A 0.25 M 0.20 of Twitter. We propose use similarity measures 0.15 to model both the local mention-based, as well as 0.10 the global context- and time-based prominence of 0.05 entities. We propose a novel strategy of topical 0.00 0 10 20 30 40 50 60 annotation of texts using and influence maximiza- burst time period window size w in days tionapproachanddesignanefficientlearningalgo- Figure 4: IPL compared to other baselines on dif- rithm to automatically unify the similarities with- ferentsizesofthebursttimewindowT. out the need of human involvement. The experi- ments show that our method outperforms signifi- cantlytheestablishedbaselines. mantic agreement between entities and informa- As future work, we aim to improve the effi- tion found in the tweets, instead of ranking their ciency of our entire workflow, such that the anno- prominence as in our work. To investigate this tation can become an end-to-end service. We also case further, we re-examined the hashtags and di- aim to improve the context similarity between en- vided them by their semantics, as to whether the tities and the topic, for example by using a deeper hashtags are spurious trends of memes inside so- distributional semantics-based method, instead of cial media (endogenous, e.g., “#stopasian2014”), language models as in our current work. In addi- orwhethertheyreflectexternalevents(exogenous, tion, we plan to extend the annotation framework e.g.,“#mh370”). Theperformanceofthemethods to other types of trending topics, by including the intermsofMAPscoresisshowninFigure3. Itcan type of out-of-knowledge entities. Finally, we are beclearlyseenthatentitylinkingmethodsperform investigating how to apply more advanced influ- wellintheendogenousgroup,butthendeteriorate ence maximization methods. We believe that in- in the exogenous group. The explanation is that fluencemaximizationhasagreatpotentialinNLP for endogenous hashtags, the topical consonance research, beyond the scope of annotation for mi- between tweets is very low, thus most of the as- crobloggingtopics. sessments become just verifying general concepts (such as locations) In this case, topical annotation Acknowledgments is trumped by conceptual annotation. However, whenever the hashtag evolves into a meaningful This work was funded by the European Commis- topic, a deeper annotation method will produce a sion in the FP7 project ForgetIT (600826) and the significantimprovement,asseeninFigure3. ERC advanced grant ALEXANDRIA (339233), Finally,westudytheimpactofthebursttimepe- and by the German Federal Ministry of Educa- riodontheannotationquality. Forthis,weexpand tion and Research for the project “Gute Arbeit” the window size w (cf. Section 6.1) and examine (01UG1249C). We thank the reviewers for the how different methods perform. The result is de- fruitfuldiscussionandClaudiaNiedereefromL3S pictedinFigure4. Itisobviousthatwithinthewin- forsuggestionsonimprovingSection5. References [Liuetal.2013] X.Liu,Y.Li,H.Wu,M.Zhou,F.Wei, andY.Lu. 2013. Entitylinkingfortweets. InACL, [Bansaletal.2015] P.Bansal, R.Bansal, andV.Varma. pages1304–1311. 2015. Towards deep semantic analysis of hashtags. InECIR,pages453–464. [Liuetal.2014] Q. Liu, B. Xiang, E. Chen, H. Xiong, F.Tang,andJ.X.Yu. 2014. Influencemaximization [Basaveetal.2014] A. E. Cano Basave, G. Rizzo, over large-scale social networks: A bounded linear A. Varga, M. Rowe, M. Stankovic, and A. Dadzie. approach. InCIKM,pages171–180. 2014. Making sense of microposts (#microp- osts2014) named entity extraction & linking chal- [Meijetal.2012] E.Meij,W.Weerkamp,andM.deRi- lenge. In4thWorkshoponMakingSenseofMicrop- jke. 2012. Addingsemanticstomicroblogposts. In osts. WSDM,pages563–572. [Cassidyetal.2012] T. Cassidy, H. Ji, L.-A. Ratinov, [MilneandWitten2008] D. Milne and I. H. Witten. A.Zubiaga,andH.Huang. 2012. Analysisanden- 2008. Learning to link with Wikipedia. In CIKM, hancement of wikification for microblogs with con- pages509–518. textexpansion. InCOLING,pages441–456. [Naamanetal.2011] M. Naaman, H. Becker, and L. Gravano. 2011. Hip and trendy: Characterizing [CiglanandNørva˚g2010] M. Ciglan and K. Nørva˚g. emergingtrendsonTwitter. JASIST,62(5):902–918. 2010. WikiPop:personalizedeventdetectionsystem based on Wikipedia page view statistics. In CIKM, [Osborneetal.2012] M. Osborne, S. Petrovic, R. Mc- pages1931–1932. Creadie,C.Macdonald,andI.Ounis. 2012. Bieber no more: First story detection using Twitter and [FangandChang2014] Y. Fang and M.-W. Chang. Wikipedia. InWorkshoponTime-awareInformation 2014. Entitylinkingonmicroblogswithspatialand Access. temporalsignals. Trans.oftheAssoc.forComp.Lin- guistics,2:259–272. [Owoputietal.2013] O. Owoputi, B. O’Connor, C. Dyer, K. Gimpel, N. Schneider, and N. A. [FerraginaandScaiella2012] P. Ferragina and Smith. 2013. Improved part-of-speech tagging for U. Scaiella. 2012. Fast and accurate annota- online conversational text with word clusters. In tion of short texts with Wikipedia pages. IEEE NAACL-HLT,pages380–390. Softw.,29(1):70–75. [Radinskyetal.2011] K. Radinsky, E. Agichtein, [Fogarasetal.2005] D.Fogaras,B.Ra´cz,K.Csaloga´ny, E.Gabrilovich,andS.Markovitch. 2011. Awordat andT.Sarlo´s. 2005. Towardsscalingfullypersonal- atime: Computingwordrelatednessusingtemporal ized PageRank: Algorithms, lower bounds, and ex- semanticanalysis. InWWW,pages337–346. periments. InternetMathematics,2(3):333–358. [Ratinovetal.2011] L. Ratinov, D. Roth, D. Downey, [Guoetal.2013] S.Guo,M.-W.Chang,andE.Kıcıman. and M. Anderson. 2011. Local and Global Algo- 2013. Tolinkornottolink? Astudyonend-to-end rithms for Disambiguation to Wikipedia. In ACL, tweet entity linking. In NAACL-HLT, pages 1020– pages1375–1384. 1030. [Ritteretal.2011] AlanRitter,SamClark,OrenEtzioni, etal. 2011. Namedentityrecognitionintweets: an [Haveliwala2002] T. H. Haveliwala. 2002. Topic- experimentalstudy. InEMNLP,pages1524–1534. sensitivePageRank. InWWW,pages517–526. [Shenetal.2013] W. Shen, J. Wang, P. Luo, and [Keeganetal.2011] Brian Keegan, Darren Gergle, and M. Wang. 2013. Linking named entities in tweets NoshirContractor. 2011. Hotoffthewiki: Dynam- with knowledge base via user interest modeling. In ics,practices,andstructuresinwikipedia’scoverage WSDM,pages68–76. ofthetohokucatastrophes. InWikiSym,pages105– 113. [Tolomeietal.2013] G.Tolomei,S.Orlando,D.Cecca- relli, and C. Lucchese. 2013. Twitter anticipates [Kempeetal.2003] D.Kempe,J.Kleinberg,andE´.Tar- bursts of requests for Wikipedia articles. In Work- dos. 2003. Maximizing the spread of influence shoponData-drivenUserBehavioralModellingand throughasocialnetwork. InKDD,pages137–146. MiningfromSocialMedia,pages5–8. [Lappasetal.2009] T. Lappas, B. Arai, M. Platakis, [TranandNguyen2014] T. Tran and T. Ngoc Nguyen. D. Kotsakos, and D. Gunopulos. 2009. On 2014. Hedera: Scalableindexing,exploringentities burstiness-awaresearchfordocumentsequences. In inWikipediarevisionhistory. InISWC,pages297– KDD,pages477–486. 300. [Lehmannetal.2012] J. Lehmann, B. Gonc¸alves, J. J. [Tranetal.2014] T. Tran, M. Georgescu, X. Zhu, and Ramasco,andC.Cattuto. 2012. Dynamicalclasses N. Kanhabua. 2014. Analysing the duration of of collective attention in Twitter. In WWW, pages trendingtopicsinTwitterusingWikipedia. InConf. 251–260. onWebScience,pages251–252.