ebook img

Event Detection in Social Streams - Charu C. Aggarwal PDF

12 Pages·2012·0.16 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Event Detection in Social Streams - Charu C. Aggarwal

Event Detection in Social Streams CharuC.Aggarwal∗ KarthikSubbian† Abstract social network. We use the term social network rather loosely, as it could refer to the messages in an online chat Social networks generate a large amount of text content over messengerservice,oritcouldevenrefertoanemailnetwork timebecauseofcontinuousinteractionbetweenparticipants. The in which messages are sent between pairs of actors. A miningofsuchsocialstreamsismorechallengingthantraditional number of interesting issues arise in such social networks, text streams, because of the presence of both text content and because they are dynamic, and are associated with network implicitnetworkstructurewithinthestream.Theproblemofevent structure in the stream. Specifically, each actor is a node detection is also closely related to clustering, because the events in the social network, and each message sent in the social canonlybeinferredfromaggregatetrendchangesinthestream.In network is the text content associated with an edge in the thispaper,wewillstudythetworelatedproblemsofclusteringand social network. Clearly, multiple messages can be sent eventdetectioninsocialstreams.Wewillstudyboththesupervised between the same pair of actors over time. In this case, andunsupervisedcasefortheeventdetectionproblem.Wepresent we would like to use the topical content of the documents, experimental results illustrating the effectiveness of incorporating theirtemporaldistribution,andthegraphicalstructureofthe network structure in event discovery over purely content-based dynamicnetworkofinteractionsinordertodetectinteresting methods. eventsandtheirevolution. Clearly,messageswhicharesent 1 Introduction betweenatightlyknitgroupofactorsmaybemoreindicative ofaparticulareventofsocialinterest,thanasetofmessages In this paper, we will study the problem of clustering and which are more diffusely related from a structural point of event detection in social streams. Much of the text data in view. Suchmessagesarestructurallywellconnected, when social scenarios arises in the context of streaming applica- the social network is viewed as a graph, with the edges tions, inwhichthetextarrivesasacontinuous andmassive corresponding to the messages sent between entities. This streamoftextsegments[2]. Suchapplicationspresentaspe- is related to the problem of community detection [3], in cialchallengeforminingalgorithms,becauseofthefactthat which we try to find structurally connected regions of the itisoftennecessarytoprocessthedatainasinglepassand social network. At the same time, the content and topics onecannotstoreallthedataondiskforre-processing. ofthedocumentsshouldalsoplayastrongroleintheevent The online event detection problem is closely related detectionprocess. Thus,bothnetworklocalityandstructure to that of topic detection and tracking [4, 5, 7, 14, 17, 22, needtobeleveragedinadynamicstreamingscenarioforthe 23, 24]. This problem is also closely related to stream eventdetectionprocess. clustering, and attempts to determine new topical trends in Therefore, the key challenges for event detection in the text stream and their significant evolution. The idea is socialstreamsareasfollows: (a)Theabilitytouseboththe that important and newsworthy events in real life are often content and the (graphical) structure of the interactions for captured in the form of temporal bursts of closely related eventdetection. (b)Theabilitytousetemporalinformation messages in a social stream. The problem can be proposed intheeventdetectionprocess. Forexample, anewtrendof in both the supervised and unsupervised scenarios. In the closelyrelatedtextdocumentsfromastructuralandcontent unsupervised case, it is assumed that no training data is pointofview,whichhavenotbeenencounteredearliermay available in order to direct the event detection process for correspond to a new event in the stream. (c) The ability to thestream. Inthesupervisedcase,priordataabouteventsis handle very large and massive volumes of text documents availableinordertoguidetheeventdetectionprocess. undertheone-passconstraintofstreamingscenarios. In this paper, we will study the problems of clustering This paper is organized as follows. The remainder and event detection in social streams in which each text of this section presents related work. In section 2, we message is associated with at least a pair of actors in the present the model for event mining in social streams. The algorithms for supervised and unsupervised event detection ∗IBM T. J. Watson Research Center, Hawthorne, NY, USA, are presented in section 3. In section 4, we present the [email protected] experimentalresults. Section5presentstheconclusionsand †DepartmentofComputerScience&Engineering,UniversityofMin- summary. nesota,TwinCities,MN,USA,[email protected] 1.1 Related Work The problemofdetermining events in sponds to the content of the interaction of an entity in streamsiscloselyrelatedtotheproblemofstreamclustering thesocialnetworkwithoneormoreotherentities. [2]. This method has been studied extensively by the text miningcommunityinthecontextofthetopicdetectionand • The object Si contains the origination node qi ∈ N tracking problem [4, 5, 7, 14, 17, 22, 23, 24]. This is whichisthesenderofthemessageTitoothernodes. alsorelatedtotheproblemofclusteringandtopicmodeling • The object Si contains a set of one or more receiver [6,22,13,16,21,27]indynamictextstreams. nodes Ri ⊆ N, which correspond to all recipients of However, in social networks, there is a rich amount of the message Ti from node qi. Thus, the message Ti is structure available in determining the key events in the net- sentfromtheoriginationnodeqi toeachnoder ∈ Ri. work. Forexample,aneventcorrespondingtoMideastUn- Itisassumedthateachedge(qi,r)belongstothesetA. rest may often correspond to text streams exchanged be- tweenmemberswhoarecloselylinkedtooneanotherbased Thus,theobjectSiisrepresentedbythetuple(qi,Ri,Ti). ongeographicalproximity. Whiletheuseoflinkagesinor- Wenotethattheabovedefinitionofasocialstreamcaptures dertodetermineclustersandpatternshasbeenwidelystud- anumberofdifferentnaturalscenariosindifferentkindsof ied by the social networking community [8, 9, ?, 11, 25], socialnetworks: these methods are typically designed for static networks. Some clustering methods have recently also been designed • In the Twitter social network, the document Ti is the fordynamicnetworks[8,9],thoughtheydonotusethecon- tweetcontent,thenodeqi isthetweetingactor,andthe tent of the underlying network for the mining process. On setRiistherecipientset. the other hand, some recent methods for pattern discovery innetworks use both content and structure [25,28], though • Email and chat interaction networks may also be con- thesemethodsarenotdefinedfortheproblemofeventdetec- sidered social networks with an exactly similar inter- tioninthetemporalscenario. Amethodin[15]isdesigned pretationtotheabove. tomeasurethediffusionandspreadcharacteristicsofknown • In many social networks, a posting on the wall of one and popular events, but is not designed for new event dis- actortoanothercorrespondstoanedgewiththedocu- covery. In this paper, we design a method which can use mentTicorrespondingtothecontentoftheposting. thecontent,structuralandtemporalinformationinaholistic way in order to detect relevant clusters and events in social Such a social stream typically contains rich information streams. aboutthetrendswhichmayleadtochangesbothinthecon- tent and the structural locality of the network in which the 2 EventMininginSocialStreams: TheModel interactions may occur. We begin with describing an unsu- Inthissection,weintroducethenotationsanddefinitionsand pervised technique for event detection, which continuously the model for event mining in social streams. Such social characterizes the incoming interactions in the form of clus- streamsareassumedtoconsistofcontent-basedinteractions ters,andleveragestheminordertoreporteventsinthedata between structurally connected entities in the data. There- stream.Weformallydefinethesocialstreamclusteringprob- fore, wewillproposeanumberofnotationsanddefinitions lem. for this purpose. We assume that the structure of the social DEFINITION2.(SOCIALSTREAMCLUSTERING) A social nisetdweonrokteidsdbeynNoteadnbdyetdhgeegsreatphisGde=not(eNd,bAy).AT.hTehneodsoecsieatl rsetrnetacmlusSt1er.s.C.S1r.....C.kis,scuocnhtitnhuaot:usly partitioned into k cur- streamcorrespondstotheinteractionsbetweenthedifferent actorsinthenodeN,andeachinteractionisanedgedrawn • Each object Si belongs to at most one of the current fromthelinkagesetA. Therefore,thegraphGprovidesin- clustersCr. formationabouttheuniverseofinteractionsinthesocialnet- • The objects are assigned to the different clusters with work. Wenextdefinetheconceptofasocialstreamwhichis theuseofasimilarityfunctionwhichcapturesboththe overlaidonthissocialnetworkstructure. contentoftheinterchangedmessages,andthedynamic DEFINITION1.(SOCIALSTREAM) A social stream is a social network structure implied by the different mes- continuous and temporal sequence of objects S1...Sr..., sages. suchthateachobjectSi correspondstoacontent-basedin- We will use both the content and linkage information in teractionbetween socialentities, andcontainsexplicitcon- order to create the clusters. Since the clusters are created tentinformationandlinkageinformationbetweenentitiesas dynamically,theymaychangeconsiderablyovertime,asthe follows: stream evolves, and new points are added to the clusters. • TheobjectSicontainsatextdocumentTiwhichcorre- Furthermore, in some cases, an incoming object may be different enough from the current clusters. In that case, it AlgorithmSocialStreamClustering(NumClusters:k); begin may be put into a cluster of its own, and one of the current clusters may be removed from the set C1...Ck. Such an IInniittiiaalliizzeeci,luμs,teσr,sMC10.,.M.C1r,Mto2nutoll;0; event may be an interesting one, especially if the newly repeat created cluster starts a new pattern of activity in which i=i+1; more stream objects are subsequently added. At the same ReceivenextsocialstreamobjectSi; time, in some cases, the events may not be entirely new, foreachclusterClcomputeSim(Si,Cl); LetrbetheindexofclusterCrwithlargestsimilaritytoSi; butmaycorrespondtosignificantchangesinthepatternsof if(Sim(Si,Cr)<μ−3·σ)thenreplacemoststale the arriving objects in terms of their relative distribution to clusterwithanewclustercontainingthesinglepointSi; clusters.Therefore,wedefinetwokindsofevents,whichare elseaddSitoCrandupdatestatisticsψr(Cr)ofclusterCr; referred to as novel events and evolution events in order to UpdateM0,M1,M2additively; describethesedifferentscenarios. μ=M(cid:2)1/M0; σ= M2/M0−μ2; until(endofstream); DEFINITION3.(NOVELEVENT) The arrival of a data end point Si is said to be a novel event if it is placed as a sin- glepointwithinanewlycreatedclusterCi. Figure1: SocialStreamClusteringforEventDetection WedenotethecreationtimeforclusterCibyt(Ci).Theevent inthiscaseisthestoryortopicunderlyingthedatapointSi and not the data point itself. We will discuss the process words,wehave: for creation and maintenance of a cluster in a social stream F(t −H,t ,C ) slightly later. Next, we define the concept of an evolution (2.1) c c i ≥α F(t(C ),t −H,C ) event. Anevolutioneventisdefinedwithrespecttospecific i c i timehorizonandrepresentsachangeintherelativeactivity Furthermore,itisassumedthattc−2·H ≥t(Ci). for that particular cluster. We first define the concept of fractionalclusterpresence. Weassumethatthevalueoftc−2·Hislargerthanthecluster creation time t(Ci) in order to define the afore-mentioned DEFINITION4.(FRACTIONALCLUSTERPRESENCE) evolutionratioinastableway. ThisensuresthatatleastH The fractional cluster presence for cluster Ci in the time unitsoftimeareusedinthecomputationofthedenominator period (t1,t2) is the fraction of records from the social ofthisratio. stream arriving during time period (t1,t2), which belong to the cluster Ci. This fractional presence is denoted by 3 SocialStreamClustering F(t1,t2,Ci). Thedesignofaneffectiveonlineclusteringalgorithmisthe key to the event detection process. Therefore, we will first The occurrence of a new event typically affects the focus on the problem of online clustering. Then, we will relative presence of the data points in the different clusters, discuss how to leverage this clustering for event detection. or it may result in a novel event. For example, the Mideast Weassumethattheclusteringalgorithmusesthenumberof Unresteventmayresultineitherthecreationofanewcluster clustersk asinput, andmaintainsthestructuralandcontent orthesignificantadditionofnewdatapointstotheclusters information in the underlying clusters in the form of node mostcloselyrelatedtothistopic. Thisisbecauseitisoften and word frequencies in the cluster. The clusters are de- possible for previously existing clusters to match closely with a sudden burst of objects related to a particular topic. noted by C1...Ck. We assume that the set of nodes asso- Thissuddenburstischaracterizedbyachangeinfractional ciated with the cluster Ci is denoted by Vi, and the set of presenceofdatapointsinclusters. Formally,wedefinesuch words associated with it is denoted by Wi. The set Vi is an event as an Evolution Event. In order to determine such referred to as the node summary, whereas the set Wi is re- ferred to as the word summary. As we will see later, this evolutionevents,wedeterminethehigherrateatwhichdata summarycharacterizationcanbeusedforassignmentofin- pointspointshavearrivedtothisclusterintheprevioustime windowoflengthH,ascomparedtothethatevenbeforeit. comingsocialstreamobjectstoclusters. ThesetVicontains Aparameterαisusedinordertomeasurethisevolutionrate. thenodesji1,ji2...jis, togetherwithnodefrequenciesde- noted by νi1...νis. The word set Wi contains the word DEFINITION5.(EVOLUTIONEVENT) An evolution event identifiers li1,li2,...lis together with word frequencies de- overhorizonH atcurrenttimetcissaidtohaveoccurredat noted by φi1,φi2...φis. One challenge with maintaining thresholdαforclusterCi,iftheratiooftherelativepresence nodesummariesinlargesocialnetworksisthatthenumber of points in cluster Ci over the horizon (tc −H,tc) to that ofnodescanbeextremelylarge(inthehundredsofmillions), beforetimetc−H isgreaterthanthethresholdα. Inother as a result of which the summary may be too large to use efficiently, especially in the online context. Later, we will node qi and the destination nodes Ri. In other words, the discuss how to use sketch-based methods in order to com- set Ri ∪{qi} is used to update the set Vr and its member putethesimilaritiesmoreeffectively. First,wewilldiscussa frequencies. The same approach is applied to the words in more straightforward method for online maintenance of the Wr with the use of the words in the social stream object clusterswiththedirectuseofthecluster-summary. Now,we Si. The only difference in this case is that the frequencies willformallydefinetheconceptofcluster-summary. of the words are not incremented only by 1, but by their frequency of presence in the underlying document. On the DEFINITION6. Thecluster-summaryψi(Ci)ofclusterCi is other hand, if the similarity of Si to Cr is greater than definedasfollows: μ−3·σ,thenwecreateasingletonclustercontainingonly • It contains the node-summary, which is a set of nodes the object Si and corresponding cluster summary statistics. ηsViiin==odνe{i1sj.i.1.,.jνi2is.i...Tjhisein}otdoegestehteVriwisitahssthuemiredfrteoqcuoenntcaieins oTconhleilsewccthliuoicsntheCrw1ra.es.p.luaCpckde.astTethhdeethmmeoolssettassstttaalrleeecccellnuutsslttyee.rrIfinrsodtmheefitnheeevdecnuatsrrtethhnaett a null cluster exists (which has never been updated), it is • Itcontainsthecontent-summary,whichisasetofword automaticallyconsideredthemoststalecluster. Anytiesare identifiers Wi = {li1,li2,...liui} together with their brokenrandomly. correspondingwordfrequenciesΦi =φi1,φi2...φiui. It remains to explain how the similarity of the stream The content-summary Wi is assumed to contain ui objectSi istocomputedtotheclusterCr. Inordertocom- words. pute the similarity, we need to compute both the structural SimS(Si,Cr)) and the content components SimC(Si,Cr) Theoverallsummaryisψi(Ci)=(Vi,ηi,Wi,Φi). ofthesimilarityvalue. Thecontent-componentsofthesim- ilarityisstraightforward,andissimplythetf-idfbased[18] Wedesignanonlinepartition-basedclusteringmethod- goelothgeyr,iwniwthhtihcheiracsleutsotefrclsuusmtemrsaCri1es..ψ.1C(kCa1r)e.m..aψinkt(aCikn)e.dAtos- cosiobmmjeipclaturtiSetyit)hbaeentwsdtreutehcnetuthcraeolncstoeinmntteilnWatriTrty.i(bbNeetelwoxnte,genwinetghtedoinsscooudcseisaslhVsotrrweaantmdo new social stream objects arrive, the clusters are continu- the nodes Ri ∪ {qi} in the social stream. Let B(Si) = ouslyupdated. Atthe sametime, thechanges intheunder- lying clusters are continuously tracked and used in order to (b1,b2,...bsr)bethebit-vectorrepresentationofRi∪{qi}, whichhasabitforeachnodeinVr,andinthesameorderas raise alarms for new events. First, we will discuss how the clustersaremaintained. Wewilldiscusstheprocessofcon- thefrequency vectorη = (νr1,νr2,...νrsr)ofVr. Thebit valueis1,ifthecorrespondingnodeisincludedinRi∪{qi} structingeventalarmsfromclusterstatisticsslightlylater. and otherwise it is 0. The structural similarity between the In each iteration, we compute the similarity of the in- coming social stream object Si to each cluster summary objectSi andthefrequency-weightednodesetofclusterCr isdefinedasfollows: ψi(Ci).Thedetailsofthesimilaritycomputationwillbepro- (cid:2) videdlater. Theincomingstreamobjectisthenassignedto sr b ·ν its closest cluster, unless the closest similarity value is sig- (3.2) SimS(Si,Cr)= (cid:3)||R ∪{tq=1}||t·((cid:2)rtsr ν ) i i t=1 rt nificantlylowerthanthatattainedforthestreamobjectsen- counteredsofar. Inordertodetermineiftheclosestsimilar- Note that we are using the L1-norm of the node-frequency ityvalueissignificantlylowerthanthatattainedbyprevious vectorinthedenominatorasopposedtothenormaluseofthe stream objects, we maintain the mean μ and standard devi- L2-norm in order to penalize the creation of clusters which ation σ of all closest similarity values of incoming stream aretoolarge. Thiswillresultinmorebalancedclusters. objectstoclustersummaries. Thesimilarityvalueissaidto TheoverallsimilaritySim(Si,Cr)iscomputedasalin- besignificantlybelowthethreshold,ifitislessthanμ−3·σ. earcombinationofthestructuralandcontent-basedsimilar- We will describe the process of dynamically maintaining μ ityvalues. andσatalaterstageofthepaper. (3.3) Ifthesimilarityvalueoftheclosestclusterisabovethe Sim(S ,C )=λ·SimS(S ,C )+(1−λ)·SimC(S ,C ) i r i r i r threshold of μ − 3 · σ, then we can assign the incoming stream object Si to its closest cluster centroid. Once the The parameter λ is the balancing parameter, and lies in the stream object Si is assigned to its closest cluster centroid range(0,1). Thisparameterisspecifiedbytheuser. Cr, we update the corresponding cluster summary ψr(Cr). It remains to explain how we maintain the mean and Specifically, any new nodes in Si, which are not already standard deviation of the (closest) similarity values of the included in Vr are added to Vr, and the frequency of the incoming objects to the clusters. For this purpose, we nodesofSi whichareincludedinVr areincrementedby1. maintain the zeroth, first and second order moments M0, We note that the nodes in Si correspond to both the source M1 and M2 of the closest similarity values continuously. Thesevaluescanbeeasilymaintainedinthestreamscenario, foreachclusterinthedata. Thesketchtableisusedforthe because they can be additively maintained over the data purpose of maintaining the frequency counts of the nodes stream. The mean μ and standard deviation σ can be in the incoming data stream. Specifically, the sketch table expressedintermsofthesemomentsasfollows: for the jth cluster is denoted by Uj. If desired, we can use (cid:3) the same set of w hash functions for the different clusters. μ=M1/M0, σ = M2/M0−(M1/M0)2 Themainconditionisthatthesetofwhashfunctionsshould be independent of one another. For each incoming object The overall algorithm for cluster maintenance is illus- Si, we update the sketch table for the cluster to which it is trated in Figure 1. The main challenge of this algorithm is assignedonthebasisofthesimilaritymeasure.Weapplythe thatthenodebasedstatisticscanberatherlargeandthecor- wdifferenthashfunctionstothe(stringrepresentationofthe respondingsimilaritycomputationscumbersome. Forexam- ple, the number of nodes in Vr may be of the order of tens identifierofthe)nodesinRi∪{qi},andadd1tothecounts of the corresponding cells. Thus, for the incoming object ofmillionsandthiscanmakethealgorithmextremelyslow. Ri,weneedtoapplyeachofhashfunctionstothedifferent Therefore,wewilldesignasketch-basedtechniqueinorder |Ri|+1differentnodes,andupdatethecorrespondingcells. tospeedupthecomputations. This corresponds to an application of (|Ri| + 1) · w hash functioninstantiationsandcorrespondingcellupdates. 3.1 Sketch-based Speedup In this section, we discuss The sketch-based structure can also be used to effec- the sketch-based technique for maintaining node statistics. tively estimate the similarity value SimS(Si,Cr). We note Sketchbasedtechniques[10]areanaturalmethodforcom- that this similarity computation needs to be performed for pressing the counting information in the underlying data so eachclusterCr,anditscorrespondingsketchtableUr inor- thatthebroadcharacteristicsofthedominantcountscanbe der to determine the closest cluster to the incoming object maintained in a space-efficient way. In this paper, we will basedonthecompositesimilaritymeasure. Thedenomina- applythecount-minsketch[10]formaintainingnodecounts tor of SimS(Si,Cr) can be exactly estimated(cid:3), because the intheunderlyingclusters.Inthecount-minsketch,ahashing objectSi isknown,andthereforethevalu(cid:2)eof |Ri∪{qi}| ainppthroeaucnhdiesrulytiilnizgeddaitnaosrtrdeearmto. kWeeeputsreacwko=ft(cid:6)hlen(n1o/dδe)(cid:7)copuanitrs- canalsobeknownexactly. Thevalueof st=r1νrt canalso beknownexactly,becauseitissimplyequaltothesumofall wise independent hash functions, each of which map onto uniformlyrandomintegersintherangeh = [0,e/(cid:10)], where thevaluesinthesketchtablecellsinUr foranyoneofthew hashfunctions. Thus,thisvaluemaybeobtainedexactlyby eisthebaseofthenaturallogarithm.Thedatastructureitself summing up the h cells for any particular1 one of the hash consists of a two dimensional array with w ·h cells with a functions. On the other hand, the numerator needs to esti- lengthofhandwidthofw. Eachhashfunctioncorresponds matedapproximately. Notethatthenumeratorisessentially tooneofw1-dimensionalarrayswithhcellseach. Instan- definedbythesumoftheestimatedvaluesofthefrequencies dardapplicationsofthecount-minsketch,thehashfunctions ofthenodesincludedinRi∪{qi}. are used in order to update the counts of the different cells The frequency of each such node can be estimated in inthis2-dimensionaldatastructure. Forexample,considera a manner discussed in [10]. Specifically, for each node 1-dimensionaldatastreamwithelementsdrawnfromamas- included in Ri ∪ {qi}, the corresponding cluster-specific sive set of domain values. For example, in our application, frequencyofthenodecanbeobtainedbyapplyingthehash thisdomainofvaluescorrespondstothedifferentnodeiden- functiontotheidentifierofeachnode.Wenotethatthevalue tifiersinthesocialnetwork. Whenanewelementofthedata ofthecorrespondinghashcellwillalwaysbeanoverestimate streamisreceived,weapplyeachofthewhashfunctionsto becauseofcollisionsbetweenthedifferentnodeidentifiersto mapontoanumberin[0...h−1]. Thecountofeachofthe thesamehashcell. Therefore,theminimumofthesevalues set of w cells is incremented by 1. In order to estimate the across the w different hash functions will also be an over- count of an item, we determine the set of w cells to which estimate, though it will be much tighter and more robust each of the w hash-functions map, and compute the mini- because of the use of different hash functions. We sum up mumvalueamongallthesecells. Letct bethetruevalueof theseestimatedfrequencyvaluesoverthedifferentnodesin thecountbeingestimated. Wenotethattheestimatedcount isatleastequaltoct,sincewearedealingwithnon-negative Ri∪{qi}. Thisisessentiallytheestimateofthenumerator. Thisestimationofthenumeratorcanbeusedinconjunction countsonly,andtheremaybeanover-estimationbecauseof with our exact knowledge about the different denominator collisions among hash cells. As it turns out, a probabilistic values in order to create an estimate of SimS(Si,Cr). Let upperboundtotheestimatemayalsobedetermined. Ithas beenshownin[10],thatforadatastreamwithT arrivals,the EstSimS(Si,Cr)representtheestimatedsimilarityofSito estimateisatmostct+(cid:10)·T withprobabilityatleast1−δ. In order to use the count-min sketch for improving the 1The sum of the h cells would be the same, no matter which hash node-count estimation process, we maintain a sketch table functionispicked. (cid:2) Cr withtheuseofthesketch-basedapproach. Then,wecan is (|Ri| + 1) · ( st=r1νrt)/h. This is because, we need showthefollowingresult: to sum up the errors over |Ri| + 1 different frequency iLsEuMsMedA, 3th.1e.nIffoar sskoemtceh-stmaballel wvaitlhuele(cid:10)ngt>h h(cid:3)an|Rdiw|+idt1h/hw, eosftitmheasteiocnesl,l-abnadsetdheesetximpeactitoednsniusm(b(cid:2)erst=ro1fνcrotl)l/ishio.nTshfeonr,awnye can use the Markov inequality to show that the probability the estimated value of the similarity EstSimS(Si,Cr) is thattheconditioninEquation3.9isviolatedwiththeuseof boun(cid:4)de√d to th(cid:5)ewfollowing range with probability at least 1hashfunctionwithprobabilityatmost (|Ri|+1)·((cid:2)st=r1νrt). (13−.4) |Rh·i(cid:2)|+1 : Whaeshcfaunngcteinoenrsatloizaettmheosptr(cid:7)ob(|aRbii|l+it1y)·o((cid:2)f tsth=ri1sνrtot)(cid:8)wwi.BnTd·hehpeerenfdoernet, B·h SimS(S ,C )≤EstSimS(S ,C )≤SimS(S ,C )+(cid:10) i r i r i r thec(cid:7)onditioninEquation(cid:8)3.9holdswithprobabilityatleast 1− (|Ri|+1)·((cid:2)st=r1νrt) w. By substituting the value of B Proof. As discussed earlier in Equation 3.2, the structural B·h similarityisgivenbythefollowingequation: intheaboveequation,wegetthedesiredresult. (cid:2) sr b ·ν (3.5) SimS(Si,Cr)= (cid:3)||R ∪{tq=1}||t·((cid:2)rtsr ν ) Theaboveresultsuggeststhatthevalueofthesimilaritycan i i t=1 rt beestimatedquiteaccuratelywiththeuseofmodestmemory requirements of a sketch table. For example, consider a We note that the numerator is (over)-estimated approxi- tweet with |R| ≈ 100, and a similarity estimation bound mately with the use of the sketch-based process, whereas of (cid:10) = 0.001. If we use a sketch table with h = 200,000 the denominator can be known exactly. It is evident that SimS(Si,Cr) ≤ EstSimS(Si,Cr) because of the over- and w = 5 (typical values), it will require a storage of only 1 million cells, which is in the megabyte order. The estimation of the numerator. It remains to show that EstSimS(cid:4)(S√i,Cr) ≤(cid:5)wSimS(Si,Cr)+(cid:10) with probability at wsimithilparriotbyaebsitliimtyaatteleliaesst1w−ith(i1n/2(cid:10)0)=5 >0.010−11o0f−t6h.eIntruperavctailcuee, least1− |Rh·i(cid:2)|+1 . sincethesetheoreticalboundsarequiteloose, theestimates Let SimSN(Si,Cr) and EstSimSN(Si,Cr) be the aremuchbetter. estimationofthenumeratorwiththesketch-basedapproach. 3.2 Event Detection with Clustering The clustering Then, since the denominator can be computed exactly, we methoddiscussedabovecanbeuseddirectlyinordertoper- have: formtheeventdetection. Asdiscussedbefore,theeventde- EstSimSN(S ,C ) (3.6) EstSimS(Si,Cr)= (cid:3)|Ri|+1·((cid:2)sti=r1νrrt) tiescutisoendafolgrotrhitehemveunstedseatetcimtioenhporroizcoenssH. Inasotrhdeerintopuptewrfhoircmh theeventdetection,wemonitortheratio F(tc−H,tc,Ci) for Furthermore,wehave: F(t(Ci),tc−H,Ci) eachclusterCi continuouslyovertime,andtriggeranalarm (3.7) SimS(Si,Cr)= (cid:3)|RSi|m+S1N·((S(cid:2)i,sCrr)ν ) awhsiegnneivfiecratnhtischraatniogeexicnetehdesuthnedtehrrlyesinhgolsdoocfiaαl.sTtrheiasms,ugwgheiscths i t=1 rt isdetectedbyasignificantchangeintheratiosofstreamob- Therefore, in order to prove the result in the lemma we jectsbeingassignedtothedifferentclusters. needtoproveboundsontheapproximationinthenumerator. Specifically,weneedtoprov(cid:4)e√thatthe(cid:5)followingholdstrue 3.2.1 Supervised Event Detection The afore-mentioned w withprobabilityatleast1− |Ri|+1 . approachdiscussesthecaseofunsupervisedeventdetection h·(cid:2) in which we only try to find significant events in the social (3.8) network without any particular regard to their nature. In (cid:3) (cid:6)sr EstSimSN(S ,C )≤SimSN(S ,C )+(cid:10)· |R |+1·( ν p)ractice, one may want to detect known events which have i r i r i rt been encountered earlier in the stream. This is the case of t=1 supervisedeventdetection. Inthiscase, weassumethatwe Let us abbreviate one of the term(cid:3)s on the right(cid:2)hand side have access to the past history of the stream in which the of the above equation by B = (cid:10)· |Ri|+1·( st=r1νrt). event E has been known to have occurred. In addition, we In other words, we need to show the above bound on the haveinformationaboutatleastasubsetofthesocialstream probabilityfortheconditionthat: tweetswhicharerelevanttothisparticularevent. Thisisthe (3.9) EstSimSN(Si,Cr)−SimSN(Si,Cr)≤B groundtruthwhichcanbeleveragedformoreaccurateevent detection. The expected value of the estimation of SimSN(Si,Cr)− In order to perform supervised event detection, we SimSN(Si,Cr)withtheuseofanyparticularhashfunction need to make some changes to the clustering portion of the algorithmaswell. Onemajorchangeisthatwedonotallow TwitterSocialStream:Thisisastreamoftweetswhichwas replacementofoldclustersorcreationofnewclusterswhen crawledfromtheTwitter socialnetwork. Eachsocialobject a new incoming point does not naturally fit in any cluster. contains the network structure and content in a tweet. The Rather, it is always assigned to its closest cluster, with any stream contained a total of 1,628,779 tweets, which were ties broken randomly. This is done in order to provide distributed over a total of 59,192,401 nodes. The nodes stabilitytotheclusteringcharacteristics,andisessentialfor include the sender and the receivers either in the form of characterizingtheeventsinaconsistentwayovertimewith direct mentions from senders or their followers in the case respecttotheclustersintheunderlyingdata. of broadcast messages. On the average, each stream object Therelativedistributionofevent-specificstreamobjects containedabout84nodespertweet. to clusters is used as a signature which is specific to the Enron Email Stream: The well known Enron email data event. These can be used in order to perform the detection set2wasconvertedtoastreamwiththeuseofthetimestamp in real time. The assumption in the supervised case is that information in the emails. Each object contained the text the training data about the social stream objects which are oftheemail,andthenetworkstructurecorrespondingtothe related to the event are available in the historical training senderandreceiver(s).Inthissense,thenetworkstructureof data. ThesignatureoftheeventE isdefinedasfollows. an email is very similar to a tweet with a single sender and multiple receivers. The Enron email data stream contained DEFINITION7.(EVENTSIGNATURE) The event signature a total of 517,432 emails. We eliminated emails that did ofasocialstreamisak-dimensionalvectorV(E)containing not have valid sender and receiver email identifiers. We the (average) relative distribution of event-specific stream also filtered out the calender invites, duplicate emails, and objects to clusters. In other words, the ith component the email history at the bottom of each email. The total of V(E) is the fraction of event-specific (training) stream emails after the filtering process were 349911, which were objectswhichareassignedclusteri. distributedoveratotalof29,083individuals.Ontheaverage, Clearly, the event signature provides a useful characteriza- eachemailcontained3.62receivers. tionoftherelativetopicaldistributionduringaneventofsig- nificance. For example, during a period of mideast unrest 4.2 Evaluation Measures We tested both the clustering (the event E), some clusters are likely to be much more ac- and event detection methods for effectiveness. For the case tivethanothers,andthiscanbecapturedinthevectorV(E), ofefficiency,wetestedonlytheclusteringmethod,because aslongasgroundtruthisavailabletodoso. Theeventsig- the majority of the time for event detection was spent in natures can be compared to horizon signatures, which are the clustering process. In order to test the effectiveness, essentially defined in the same way as the event signatures, we used a number of class labels which were associated exceptthattheyaredefinedoverthemorerecenttimehori- with the objects in the social stream. For the case of the zon(tc−H,tc)oflengthH. Twitter stream, these class labels were the hash tags which wereassociatedwiththetweet. Themostfrequenthashtags DEFINITION8.(HORIZONSIGNATURE) The horizon sig- in the stream often correspond to events in the stream such nature over the last time period (tc − H,tc) is a k- as Uganda protests or the Japan Earthquake, and represent dimensionalvectorcontainingtherelativedistributionofso- meaningful characteristics of the underlying objects which cialstreamobjectstoclusterswhichhavearrivedinthepe- ought to be clustered together. These hash tags also often riod(tc−H,tc). representthemeaningfuleventsinthestream. Wenotethat Inordertoperformthesupervisedeventdetection,wesimply not every tweet may contain hash tags, and therefore, the compute the dot product of the horizon signature with the hash tags were only associated with a subset of the tweets. knowneventsignature(whichwascomputedbytheground For the case of the Enron email stream, the class labels truth),andoutputanalarmlevelwhichisequaltothisvalue. weredefinedbythemostfrequentlyoccurringtokensinthe The tradeoff between false positives and false negatives is subjectline. Thesetokenswereasfollows: determinedbythethresholdchosentodecidewhensuchan meeting,agreement,gas,energy,power,report,update,request,conference, eventhasreallyoccurred. letter, deal, credit, california, trading, contract, project, presentation, houston,announcement 4 ExperimentalResults All emails which contained any of the above tokens in the subject line were tagged with the corresponding class Inthissection,wewillstudytheclusteringandeventdetec- label. Thus, for both data streams, a subset of the stream tionalgorithmsforeffectivenessandefficiency.Weusedtwo objects were tagged with a class label. Clearly, clusters of realdatasetstoevaluatetheeffectivenessofourapproach. high quality would tend to put objects with similar tags in 4.1 Data Sets Our algorithm was tested on the following twodatasets: 2http://www.cs.cmu.edu/˜enron/ thesamecluster.Therefore,wecomputedthedominantclass by setting λ to the extreme values of 0 and 1, we can test purity of each cluster. For each cluster, we determined the how well the algorithm performs, when we use either the tagwiththehighestpresence, andcomputedthefractionof networkonly,orthetextonlyfortheclusteringprocess. We the (tagged) cluster objects, which belonged to that label. alsotestedacombinationschemeinwhichthevalueofλwas This value was averaged over the different clusters in a setto0.5. Thiskindofschemeprovidesequalweighttothe weightedway,wheretheweightofaclusterwasproportional textandcontentintheclusteringandeventdetectionprocess. tothenumberof(tagged)objectsinit. Forthecombinationscheme,wetesteditbothwithandwith- We also tested the efficiency of the social stream clus- out the sketch in order to estimate the effects of sketch use tering process. In order to test efficiency, we computed the ontheschemeintermsofaccuracyandefficiency. Wenote number of social stream objects processed per unit of time, thatthetext-onlyvariationcanbeconsideredquitesimilarto andwepresentedthisnumberforthedifferentalgorithms. thestreamtext-clusteringapproachdiscussedin[1]. There- Wealsoalsotestedtheeffectivenessoftheeventdetec- fore, this variation also serves as a good baseline, because tion algorithm. For the case of the unsupervised algorithm, theuseofpuretextcontentistheonlynaturalalternativefor we will present a case study, which illustrates the interest- thisproblematthisjuncture. ingeventsfoundbythetechnique. Thisprovidesanintuitive ideaabouttheeffectivenessofthetechnique. Forthesuper- 4.4 Effectiveness Results for Clustering First, we will vised algorithm, we first need to generate the ground truth present the effectiveness results of the clustering algorithm forthesupervisedalgorithm. Forthispurpose, weusedthe intermsoftheclusterpurity. Wetestedtheeffectivenessof hashtagsintheTwitterStreaminordertogeneratea0-1bit theapproachwithincreasingnumberofclusters. Theresults stream corresponding to when the events actually occurred. for the Twitter and Enron streams are illustrated in Figure We used the two hash tags corresponding to #japan and 2(a)and(b)respectively. Thesketchtablelengthhwasset #uganda in order to generate the events in our Twitter data to 262,213, whereas the sketch table width w was set to 2 stream. Specifically, at each time stamp, we looked at the for the Twitter social stream, and these values were set to past window of length h and counted the number of occur- 16,369 and 2 for the Enron data stream. In each case, we rencesofaparticularhashtaginthatwindow. Ifthenumber have illustrated the number of clusters on the X-axis, and of occurrences of the hash tag was at least 3, then we gen- theclusterpurityontheY-axis.Averyinterestingtrendwas eratedabitof1atthattimestamptoindicatethattheevent observedforboththedatasetsintermsoftherelativeperfor- hasindeedoccurred. manceofthedifferentalgorithms. Inallcases,thealgorithm We note that our supervised event detection algorithm whichusedonlytextperformedtheworstamongalltheal- generates a continuous alarm level. It is possible to use gorithms. Thetrendsbetweenthepurelynetwork-basedap- a threshold t on this continuous alarm level in order to proach and combined approach were dependent upon the generate a 0-1 bit stream corresponding to the algorithmic level of granularity at which the clustering was performed. predictionofwhentheeventhasoccurred.Byusingdifferent For both data streams, we found that when a small number thresholds t on this alarm level, we can obtain different ofclusterswereused,thenetwork-basedapproachwassupe- tradeoffsbetweenprecisionandrecall. LetSF(t)betheset rior.Whenalargernumberofclusterswereused,thecombi- of time stamps at which an alarm is generated with the use nation methods outperformed the purely network-based ap- of threshold value of t on the real alarm level. Let SG be proach. This is because the network locality information is thegroundtruthsetoftimestampsatwhichtheeventtruly morethansufficienttoeffectivelypartitiontheclusters,when occurs. Then,Precision(t)andRecall(t)canbecomputed the granularity is relatively coarse. In such cases, the addi- asfollows: tion of text does not improve the quality of the underlying |S (t)∩S | clusters, and can even be detrimental to clustering quality. (4.10) Precision(t)= F G However, when the number of clusters increases, the com- |S (t)| F binationapproachtendstoperformbetter,becausethegran- |S (t)∩S | (4.11) Recall(t)= F G ularity of the clusters is much higher, and therefore more |S | G attributes are required to distinguish between the different (4.12) clusters. This is particularly evident in the case of the En- rondatastream,inwhichthegapbetweenthecombination- We presented the tradeoff between the precision and recall approachandpurelynetwork-basedapproachisratherlarge. byvaryingthevalueoftinordertogenerateaplotbetween Inallcases, wefoundthattheuseofsketchesleadtosome thetwo. lossofaccuracy. However, thislossofaccuracyisnotvery significant, especially when we consider the fact that the 4.3 AlgorithmVariationsandBaseline Wetesteddiffer- sketch-basedapproachwassignificantlyfaster. Itisalsoim- entalgorithmsettingstodeterminetheeffectofcontentand portant to note that the pure text-based approach (which is network structure on accuracy and efficiency. We note that 0.15 0.75 0.35 text-only text-only 0.14 network-only network-only GE CLUSTER PURITY 0000 .0...0111.91123 skentocnh--sbkaestechd((00..55)) GE CLUSTER PURITY 0 0.06..675 GE CLUSTER PURITY 0 0.02..253 skentocnh--sbkaestecdh((00..55)) AVERA 00..0078 AVERA 0.55 netwtoerxkt--oonnllyy AVERA 0.15 0.06 skentocnh--sbkaestechd((00..55)) 0.1 0.05 0.5 100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000 20 40 60 80 100 120 140 160 180 200 NUMBER OF CLUSTERS NUMBER OF CLUSTERS PROGRESSION OF STREAM (a)PuritywithNumber (b)PuritywithNumber (c)PuritywithStream ofClusters(Twitter) ofClusters(Enron) Progression(Twitter) 0.12 0.12 0.84 text-only sketch-based(0.5) sketch-based(0.5) GE CLUSTER PURITY 00000 ....0.77778.246882 skentocnhn--esbtkwaesotecrkhd-((o00n..55ly)) GE CLUSTER PURITY 000 ..0. 0110.9011.51515 non-sketch(0.5) GE CLUSTER PURITY 000 ..0.0 110.9011.11555 non-sketch(0.5) AVERA 0 .06.87 AVERA 0 .00.8059 AVERA 0 .00.8095 0.66 0.08 0.08 50 100 150 200 250 300 100 200 300 400 500 100 200 300 400 500 PROGRESSION OF STREAM HASH TABLE SIZE (x 1000) HASH TABLE SIZE (x 1000) (d)PuritywithStream (e)HashTableLength (f)HashTableLength Progression(Enron) Sensitivity(Twitter) Sensitivity(Twitter) Figure2: EffectivenessResultsforClustering ourbaseline)performedtheworstinallscenarios. Thesere- tered. As a result, the cluster purity also generally reduces sultsalsoseemtosuggestthatthenetworkinformationinthe withstreamprogression. socialstreamprovidesmuchmorepowerfulinformationfor Finally, we also tested the sensitivity of the approach clustering as compared to the text-information. This is not with sketch-table length and width. The sensitivity re- particularly surprising for sources such as the Twitter data sultswithincreasingsketch-tablelengthfortheTwitterdata stream in which the text is rather noisy and often contains streamisillustratedinFigures2(e). Thesketchtablelength non-standard acronyms or other text which are hard to use isillustratedontheX-axis, whereastheclusterpurityisil- in a meaningful way. However, we found it somewhat in- lustrated on the Y-axis. The number of clusters was set to terestingandsurprisingthatthesetrendswerealsotruefora 500inthiscaseandsketchtablewidthwassetto2. Wehave sourcesuchastheEnrondatastream,inwhichthetextwas alsoshowntheresultsforthemethodwhichdoesnotusethe relativelycleanandwasusuallyquiteinformativefortheun- sketch in the same figure in order to provide a baseline for derlyingclasslabels. the relative effectiveness of the incorporation of the sketch Wehavealsoillustratedtheclusterpuritywithprogres- structure. Itisclearthatthecluster-purityincreaseswithin- sion of the stream. This provides a dynamic idea of how creasingsketch-tablelength. Thisisbecausealargersketch- the clustering process performed during the progression of table length reduces the number of collisions in the sketch thestreamitself. TheresultsfortheTwitterandEnrondata table, and it therefore improves the overall accuracy. We streams are illustrated in Figures 2(c) and (d) respectively. alsonotethatwhenthesketch-tablelengthisincreasedsuf- Thenumberofclusterswasfixedto750.FortheTwitterdata ficiently, the accuracy of the approach based on the sketch- stream the sketch table length was fixed to 262,213 and for tableapproaches thatofthemethodwhichdoesnotusethe Enronitwassetto16,369. Thesketchtablewidthwasfixed sketchtable. Insuchcases,thecollisionsreducesufficiently to2inbothcases. TheX-axisillustratestheprogressionof to the point that there of reduction in accuracy because of thestreamforevery1000objectsprocessed,andtheY-axis sketch=tableuse. illustratestheclusterpurityinthelasttimewindowof1hour. We also tested the sensitivity of the approach with Therelativetrendsbetweenthedifferentmethodsaresimilar increasing sketch-table width. The sensitivity results with to our earlier observations, though the main observation is increasing sketch-table width for the Twitter data stream is thattheclusterpuritiesgenerallyreducewithprogressionof illustratedinFigures2(f). Thenumberofclusterswassetto the stream. The reason for this is that the number of class 500 and sketch-table length was set to 262,213. While the labels in the stream generally increase over the progression effectiveness results improve with increasing sketch-table of the stream, as new classes, tags and events are encoun- width, it is evident that the purity results are not quite as sensitive to sketch-table width, as they were to the sketch- ratherthanwidth. table length. This is because an increase in the number of hash functions provides additional robustness, but it does 4.6 Unsupervised Event Detection Case Study In this not drastically reduce the number of collisions between the section,weprovidedacasestudyoftheunsupervisedevent differentitems. detectionproblem. Bothevolutionaryandnoveleventswere detectedwiththeuseofthisapproach.Typically,suchevents 4.5 EfficiencyResultsforClustering Wealsotestedthe wereassociatedwithaparticularnews,country,languageor efficiencyoftheclusteringapproachwithincreasingnumber mood. ofclusters. Theparametersettingswerethesameasthoseof The first event was related to the Japan nuclear crisis. theeffectivenessresults. TheefficiencyresultsfortheTwit- In particular, the relevant segment of the social stream terandEnrondatastreamsareillustratedinFigures3(a)and corresponds to the case where the Japanese Prime Minister (b) respectively. The X-axis denotes the number of clus- requested the Chubu electric company to shut down the ters, whereas the Y-axis denotes the number of stream ob- Hamaoka nuclear plant. This event generated considerable jects processed every hour. It is evident that the network- chatter in the social stream, in which the underlying actors based approach was much slower than the text-based ap- werediscussingthepositivesandnegativesofthisdirective. proach. Thisisbecauseofthelargenumberofdistinctnodes The corresponding portion of the social stream contained whichneedtobeprocessedbyatext-basedapproach. How- structural nodes which were geographically biased towards ever,thesketch-basedapproachissignificantlyfasterinboth Japan, and generally created clusters of their own on the data streams, because of the fact that the number of opera- basis of the underlying network and content information. tionsinthesimilaritycomputationarereducedbythesketch The frequent text content in these clusters included the representation. Another interesting observation is that the followingwords: processing rate did not necessarily reduce with increasing nuclear,hamaoka,plant,concerns,dilemma,gaswat,hairline,halt, number of clusters. Much of this was also because the a shut,heavy,japan,neglect,operation(1.152) larger number of clusters resulted in faster similarity com- One interesting aspect ofour algorithm was that itwas putationsbetweenincomingobjectsandtheunderlyingclus- abletodetectevents,forwhichthecontentwasinaforeign ters. Thisisbecausealargernumberofclustersresultedin language. The reason for this is that the event detection sparserclusterswithfewernumberofobjectsineachcluster. algorithm does not use any methods which are specific to We also tested the efficiency with stream progression English,andtheuseofnetworkstructureisblindtotheuse and present the results in Figure 3(c) and (d) respectively. ofaspecificlanguage. Forexample,theministerforfinance ThestreamprogressionisillustratedontheX-axis,andthe in Indonesia issued an order on May 9, 2011 to buy 7% of processingrateisillustratedontheY-axis. Itisevidentthat the shares in PT Newmont Nusa Tenggara company. This the processing rate reduces with stream progression for all triggered discussion threads in twitter which were captured the different methods. This is because the clusters contain because of their related content and network structure. The a larger number of objects with progression of the stream. frequent text-content which was associated in the cluster Thisincreasesthecomplexityofthesimilaritycomputations mostrelatedtothiseventwasthefollowing: because the number of attributes in each cluster (in terms keputusan, menkeu, beli, newmont, saham, wapres, didukung, chal- of the number of text words or nodes) increases as well. lenge,minskade,metro,jak,kimiad,menos(0.9964) However, this slow-down levels out after a certain point as Wenotethattheentiretextcontentinthiseventconsists thenumberofattributesineachclusterstabilizes. offoreignlanguagewords. We also tested the sensitivity of the approach with Oureventdetectionalgorithmwasalsocapableofcon- hash-table length and width. The sensitivity results with nectingrelatedevents. Thisisbecausetheactorsandevents hash-table length and width for the Twitter data streams areoftenoverlappinginsuchcases.Thisoverlapcouldresult are illustrated in Figures 3(e), and (f). It is evident from in the placement of closely related events in the same clus- the figures that the running time is not very sensitive to ter. AnexampleofthisweretheprotestsonMay9,2011on the hash table length, and most of the variations are really the Gay bill death penalty in Uganda. Many of these same random variations. On the other hand, the hash table width actors were also discussing issues related to the Kentucky affects the number of hash functions which need to be anti-gaymarriagelaw,eventhoughthetwoeventswerege- computed. Therefore, the running time generally reduces ographically quite disparate. As a result, these two related with increasing hash table width. These results seem to eventswerereportedasasinglecompositeevent,sincethey suggest that a greater hash table width is not particularly reflected issueswhich were discussed in relation to one an- useful because it does not affect the purity very much, but otherbytheparticipants. Thecorresponding contentwords it affects the efficiency substantially. Therefore additional wereasfollows: memory should be used for increasing hash table length protest,power,sign,identity,much,please,political,citizens,petition,

Description:
one cannot store all the data on disk for re-processing. The online event detection problem is closely in the stream often correspond to events in the stream such
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.