On Classification of Graph Streams Charu C. Aggarwal∗ Abstract is typically a small sub-graph of the web graph. The browsing pattern over all users can be interpreted as a In this paper, we will examine the problem of classification continuous stream of such graphs. (3) The intrusion of massive graph streams. The problem of classification trafficonacommunicationnetworkisastreamoflocal- has been widely studied in the database and data mining ized graphs on the massive IP-network. community. The graph domain poses significant challenges This paper will examine the most challenging case because of the structural nature of the data. The stream in which the data is available only in the form of a scenario is even more challenging, and has not been very stream of graphs (or edge streams with graph iden- wellstudiedintheliterature. Thisisbecausetheunderlying tifiers). Furthermore, we assume that the number of graphs can be scanned only once in the stream setting, and distinct edges is so large, that it is impossible to hold it is difficult to extract the relevant structural information summary information about the underlying structures fromthegraphs. Inmanypracticalapplications, thegraphs explicitly. For example, a graph with more than 107 may be defined over a very large set of nodes. The massive nodes may contains as many as 1013 edges. Clearly, it domain size of the underlying graph makes it difficult to may not be possible to store information about such a learn summary structural information for the classification large number of edges explicitly. Thus, the complexity problem,and particularly soin thecase ofdatastreams. In of the graphstream classification problem arises in two ordertoaddressthesechallenges,weproposedaprobabilistic separate ways: approach for constructing an “in-memory” summary of the (1) Since the graphs are available only in the form of a underlying structural data. This summary determines the stream,this restrainsthe kindsofalgorithmswhichcan discriminativepatternsintheunderlyinggraphwiththeuse be used to mine structural information for future anal- ofa2-dimensionalhashingscheme. Weprovideprobabilistic ysis. An additional complication is caused by the fact bounds on the quality of the patterns determined by the the the edges for a particular graph may arrive out of process, and experimentally demonstrate the quality on a order (i.e. not contiguously) in the data stream. Since numberof real data sets. Keywords: Graphs, Classification re-processingis notpossible inadatastream, suchout- of-orderedgescreatechallengesforalgorithmswhichex- 1 Introduction tract structural characteristics from the graphs. (2)Themassivesizeofthegraphcreatesachallengefor Theproblemofgraphclassificationarisesinthecontext effective extraction of information which is relevant to of a number of classical domains such as chemical and classification. For example, it is difficult to even store biological data, the web, and communication networks. summary information about the large number of dis- Recent methods for representing semi-structured data tinct edges in the data. In such cases, the determina- have also lead to increasing interest in this field. A tion of frequent discriminative subgraphs may be com- detailed discussion of graph mining and classification putationally and space-inefficient to the point of being algorithms may be found in [4]. impractical. Recently,theemergenceofinternetapplicationshas The currently availablealgorithmsfor classification lead to increasing interest in dynamic graph streaming [13,14,21]arenotdesignedfortheenormouschallenges applications[2,3,10]. Suchnetworkapplicationsarede- inherent in the scenarios discussed above. In this pa- fined on a massive underlying universe of nodes. Some per, we will design a discriminative subgraph mining examples are as follows: approach to graph classification. We will define dis- (1) The communication pattern of users in social net- criminativesubgraphsbothintermsofedgepatternco- workin a modest time window canbe decomposedinto occurrence and the class label distributions. The pres- a group of disconnected graphs, which are defined over ence of such subgraphs in test instances can be used amassive-domainofusernodes. (2)The browsingpat- in order to infer their class labels. Such discrimina- tern of a single user (constructed from a proxy log) tive subgraphs are difficult to determine with enumer- ation based algorithms because of the massive domain ∗IBMT.J.WatsonResearchCenter,[email protected] of the underlying data. Therefore, we will design an 2 Discriminative Subgraph Patterns efficientprobabilisticalgorithmfordeterminingthedis- In this section, we will present the model for mining criminativepatterns. Theprobabilisticalgorithmusesa classification patterns in graph streams. We will first hash-basedsummarizationapproachto capturethe dis- introduce some notations and definitions. We assume criminative behavior of the underlying massive graphs. that the graph is defined over the node set N. This We will show how this compressed representation can nodesetisassumedtobedrawnfromamassivedomain be leveraged for effective classification of test (graph) of identifiers. The individual graphs in the stream are instances. denoted by G1...Gn.... Each graph Gi is associated This paper is organized as follows. In the next with the class label Ci which is drawn from {1...m}. section,wewillformalizetheproblemofclassificationof It is assumed that the data is sparse. The sparsity graph streams with the use of discriminative patterns. property implies that even though the node and edge We will leverage this framework in order to design an domain may be very large, the number of edges in the effective algorithm for graph classification in section individual graphs may be relatively modest. This is 3. Section 4 presents the experimental results. The generally true over a wide variety of real applications conclusions and summary are presented in section 5. such as those discussed in the introduction section. We further note that in the stream case, the edges 1.1 Related Work The problem of graph mining of each graph Gi may not be neatly received at a is often studied in the context of static data such as given moment in time. Rather, the edges of different chemical compounds or molecular biology in which the graphs may appear out of order in the data stream. nodes are drawn from a modest base of possibilities. This means that the edges for a given graph may not Since the graphs may be drawn on a limited domain, it appearcontiguouslyinthestream. Thisisdefinitelythe significantly eases the ability to perform mining tasks case for many applications such as social networks and in such cases. Furthermore, it is assumed that the communication networks in which one cannot control data set is of modest size, and it can be stored on the ordering of messages and communications across disk or main memory for mining purposes. Many different edges. Our paper will therefore assume this recent graph mining algorithms for frequent pattern difficult caseinwhichthe edges donotappear inorder. mining [12, 16, 20], clustering [1, 17] and classification Forthepurposesofthispaper,weassumethattheedges [13, 14, 21] are designed in the context of this modest are received in the form < GraphId >< Edge >< assumption. ClassLabel >. Note that the class label is the same Theproblemofgraphclassificationisstudiedintwo across all edges for a particular graph identifier. For differentcontexts. Intheproblemofnode classification, notational convenience, we can assume that the class the goal is to determine labels for the nodes with the label is appended to the graph identifier, and therefore use of linkage information from other nodes [13, 14]. we can assume (without loss of generality or class However,inobject based graph classification, the goalis information) that the incoming stream is of the form toclassifyindividual graphsasobjects[21]. Wefocuson < GraphId >< Edge >. The value of the variable the latter case, for which there are currently no known <Edge> is defined by its two constituent nodes. algorithms in the stream scenario. Some recent work In this paper, we will design a rule-based approach proposes algorithms for clustering and dense pattern which associates discriminative subgraphs to specific mining in graph streams [2, 3]. The work in [3] class labels. Therefore, we need a way of quantifying uses count-min sketches in order to cluster massive the significance of a particular subset of edges P for graph streams, though such an approach does not the purposesofclassification. Ideally, wewouldlikethe easilyextendtoclassificationbecauseoftheinformation subgraph P to have significant statistical presence in loss associated with the sketch representation. The terms of the relative frequency of its constituent edges. work in [2] also uses min-hash summaries for dense At the same time, we would like P to be discriminative pattern mining, though the approach is focussed on in terms of the class distribution. determining node patterns, rather than structural edge- Thefirstcriterionretainssubgraphswithhighrela- based discriminative patterns, as would be required in tive frequencyofpresence, whereasthe secondcriterion a classification application. As we will see, the latter retains only subgraphs with high discriminative behav- is a much more challenging problem because of the ior. We will discuss effective techniques for determin- significantly largernumber of edges as comparedto the ing patterns which satisfy both of these characteristics. nodes. This necessitates the design of a 2-dimensional First, we will define the concept of a significant sub- hash structure which is a combination of a min-hash graph in the data. A significant subgraph P is defined and a standard hashing scheme. asasubgraph(setofedges),forwhichtheedgesarecor- related with one another in terms of absolute presence. (b) Class Discrimination Constraint: The dom- We also refer to this as edge coherence. This concept is inant class confidence DI(P) is at least θ. In other formally defined as follows: words, we have DI(P)≥θ. Definition 1. Let f∩(P) be the fraction of graphs in The above constraints are quite challenging because of G1...Gn inwhichalledgesofP arepresent. Letf∪(P) the size of the search space which needs to be explored bethefractionofgraphsinwhichatleastoneormore inordertodeterminerelevantpatterns. Wehavechosen of the edges of P are present. Then, the edge coherence this approach, because it is well suited to massive C(P) of the subgraph P is denoted by f∩(P)/f∪(P). graphs in which it is important to prune out as many We note that the above definition of edge coherence irrelevant patterns as possible. The edge coherence is focussed on relative presence of subgraph patterns constraintisdesignedtopruneoutmanypatternswhich rather than the absolute presence. This ensures that are abundantly present, but are not very significant only significant patterns are found. Therefore, large fromarelativeperspective. Thishelpsinmoreeffective numbers of irrelevant patterns with high frequency but classification. lowsignificancearenotconsidered. Whilethecoherence definition is more effective, it is computationally quite 3 A Probabilistic Approach for Finding challengingbecauseofthesizeofthesearchspacewhich Discriminative Subgraphs may need to be explored. The randomized scheme Inthissection,wewilldescribeaprobabilisticmin-hash discussed in this paper is specifically designed in order approachfordeterminingdiscriminativesubgraphs. The to handle this challenge. min-hash technique is an elegant probabilistic method, Next, we define the class discrimination power of which has been used for the problem of finding inter- the differentsubgraphs. Forthis purpose, wedefine the esting 2-itemsets [8]. This technique cannot be easily class confidence of the edge set P with respect to the adapted to the graph classification problem, because of class label r ∈{1...m}. the large number of distinct edges in the graph. There- fore, wewill use a2-dimensionalcompressiontechnique Definition 2. Among all graphs containing subgraph in which a min-hash function will be used in combina- P, let s(P,r) be the fraction belonging to class label r. tion with a more straightforward randomized hashing We denote this fraction as the confidence of the pattern technique. We will see that this combination approach P with respect to the class r. is extremely effective for graph stream classification. Correspondingly,wedefinetheconceptofthedominant The aim of constructing this synopsis is to be class confidence for a particular subgraph. able to designa continuously updatable data structure, which can determine the most relevant discriminative Definition 3. The dominant class confidence DI(P) subgraphs for classification. Since the size of the or subgraph P is defined as the maximum class confi- synopsisis small, itcanbe maintainedinmainmemory dence across all the different classes {1...m}. and be used at any point during the arrival of the data stream. The ability to continuously update an AsignificantlylargevalueofDI(P)foraparticulartest in-memory data structure is a natural and efficient instanceindicatesthatthe patternP isveryrelevantto design for the stream scenario. At the same time, classification, and the corresponding dominant class la- thisstructural synopsis maintainssufficientinformation bel maybe anattractivecandidateforthe testinstance which is necessary for classification purposes. label. For ease in further discussion, we will utilize a tab- In general we would like to determine patterns ularbinaryrepresentationofthe graphdatasetwith N whichareinterestingintermsofabsolutepresence,and rows and n columns. We note that this table is only arealsodiscriminativefora particularclass. Therefore, conceptually used for description purposes, but it is not we use the parameter-pair (α,θ) which correspond to explicitly maintained in the algorithms or synopsis con- threshold parameters on the edge coherence and class struction process. The N rows correspond to the N interest ratio. different graphs present in the data. While columns Definition 4. A subgraph P is said to be be (α,θ)- represent the different edges in the data, this is not a significant, if it satisfies the following two edge- one-to-onemapping. Thisisbecausethenumberofpos- coherence and class discrimination constraints: sible edges is so large that it is necessary to use a uni- (a) Edge Coherence Constraint: The edge- form random hash function in order to map the edges coherence C(P) of subgraph P is at least α. In other onto the n columns. The choice of n depends upon words, we have C(P)≥α. the space available to hold our synopsis effectively, and affects the quality of the final results obtained. Since column in P(cid:4) with a 1-value are the same is equal to the many edges are mapped onto the same column by the edge coherence C(P). hash function, the support counts of subgraphpatterns The above observation is easy to verify, since the index are over-estimated with this approach. Since the edge- of the first row with a 1 in any of the columns in P(cid:4) is coherenceC(P)isrepresentedasaratioofsupports,the also the first row in which any of the edges of P(cid:4) are edge coherence may either be over-or under-estimated. present. The probability that this row contains all 1- Inordertosimplifyourdiscussion,wewillfirstdescribe values across the relevant columns is simply the edge the algorithm without the use of a hash-mapping of coherence probability C(P). We can use this approach the columns. Then, we will discuss the changes needed in order to efficiently construct a sampling estimate of to implement a column-wise hash mapping. Therefore, theedgecoherenceprobability. Wesimplyneedtousek thesetwocasesaresequentiallydiscussedinorderofin- differenthashvaluesforconstructingthesortorder,and creasing complexity: estimatetheabove-mentionedprobabilitybycomputing • Complete Table with no column-wise hash- the fraction of the k samples for which all values are 1. mapping: Inthiscase,wehaveacolumncorresponding First,wewilldiscusstheimplementationofthemin- to each distinct edge. Then, we apply min-hash com- hashapproachwith the use of a complete virtual table, pression on the rows only. While this is not a practical and no compression on the columns. In this case, we solution for real applications, it is useful to discuss it assume that the table has E columns, where E = N · first in order to set the stage for the discussion of the (N−1)/2isthenumberofpossibledistinctedgesinthe 2-dimensional compression scheme. data. The min-hash index is used in order to construct •CompressedTablewithcolumn-wisehashmap- a transactional representation of the underlying data. ping: Inthiscase,we mapmultiple columnsintoasin- First, we apply the min-has approach with k different glecolumnwiththeuseofaconventionalhashfunction. hash values in order to create a table of size k × E. Then, we apply the min-hash compressionon the rows. Now, let us examine a particular sample fromthis min- This is the main scheme designed by our approach. hash representation. Let r1...rE be the row indices The use of these two kinds of virtual tables is of the first row with a 1 in the corresponding column. helpful for analysis of the scheme with increasingly We partition the set r1...rE into groups for which the complex assumptions. Specifically, we will analyze min-hash index values are the same. Thus, if r1...rE our scheme with or without hash-compression on the are partitioned into Q1...Qh, then each ri ∈ Qj has columns. First, we will propose the simple scheme the same value (for fixed j). For each partition Qj, we without hash compression on the columns (and only create a categorical transaction containing the indices a min-hashing scheme on the rows), and analyze it of the corresponding columns. For example, if Qj = in the context of a complete tabular representation {r2,r23,r45,r71}, then we create the transaction Tj = of the edge representation. Then, we will analyze {2,23,45,71}. We can construct transactions T1...Th the additional effects of the 2-dimensional approach in corresponding to the equi-index partitions Q1...Qh. whichwealsoapplyauniformrandomhashcompression Sinceeachindexsetcreatesacorrespondingtransaction on the columns. We will show that the overall effect of set, we can improve the accuracy of our estimation this approach is to create a highly effective classifier process by repeating this sampling process k times and which can work with fast graph streams. creating k different index sets. For each index set, we add corresponding transaction sets to the overall 3.1 Min-hashApproach withCompleteVirtual transaction set T. We make the following observation: Table The main idea of the min-hashing scheme is to determine co-occurrence in the values on different Observation 2. Let T be the transactions constructed columnsbyusingthesort-orderontherowsinthedata. from the min-hash index set. Then, the coherence In order to create this sort-order, we generate a single probabilityC(P)ofanedgesetP canbeestimatedasthe uniformly random hash value for each row in the data. absolute support of P(cid:4) in the transaction set T, divided All the columns in the data are sorted by order of this by k. hash value. Note that such an approach results in each Observation 2 is a direct extension of Observation 1. column being sortedin exactly the same randomorder. ThisisbecausethecolumnsetP(cid:4)supportsatransaction We observe the following; in T1...Th if and only if the min-hash indices of the Observation 1. Consider a set P of edges. Let P(cid:4) be edgesinP(cid:4) arethesameinthe correspondingmin-hash the columns in the virtual table corresponding to P. We sample from which that transaction was constructed. examine these columns in sorted order by hash value. It remains to explain the details of the implemen- The probability that the indices of the first row for each tation in the graph stream scenario. We note that the AlgorithmUpdateMinHash(Incoming Edge: e∈G, V and I for each distinct edge encountered so far in SamplingRate: k,Edge-specificMin-hashvalues: V the data stream. For each edge e, the ith minimum Edge-specificMin-hashindices: I); {AssumethatgraphGisassociatedwithanumerical hash samples in V represents the minimum value of identifierId(G)whichdenotes theorderinwhichit h(i,Id(G))overallgraphs Gwhichcontaine. Theken- wasreceivedindatastream;} tries in I for a particular edge e represent the k graph begin identifiers for which these minimum hash values in V Generatek randomhashvaluesp1...pk, were obtained. wherepi=h(i,Id(G)); foreachpi inthek generated hashvaluesdo As mentioned earlier, the edges in the stream may begin appear out of order, and therefore each edge is asso- ifedgeeisnotincludedincurrentsetof ciated with its corresponding graph identifier in the min-hashstatistics (I,V)then stream. For each incoming edge e and graph iden- Add(e,pi)toV and(e,Id(G))toI; tifier Id(G), the sets V and I are updated as fol- else begin Let(e,m)∈V and(e,CurrentId)∈I bethe lows. We generate the k different hash values p1 = currenttuplesinV andI associatedwith h(1,Id(G))...pk = h(k,Id(G)). In the event that the theithmin-hashvaluefore; edge e has not been encountered so far, its information ifpi<mreplace(e,m)and(e,CurrentId)with willnotbepresentinV andI. Insuchacase,thevalue (e,pi)and(e,Id(G))respectively; of L is incremented, and the corresponding entries for end e are included in V and I respectively. Specifically, the end end k hash values in V for this newly tracked edge are set tothekrecentlygeneratedhashvaluesp1...pk. There- fore,weincludetheentries(e,p1)...(e,pk)inV. Corre- Figure 1: Updating Min-hash Index for each Incoming spondingly, we add the entries (e,Id(G))...(e,Id(G))) Edge to V. On the other hand, if the edge e has al- ready been encountered in the stream, then we exam- ine the corresponding entries (e,MinHashValue1)... rows of our (virtual) table representing edge presence (e,MinHashValuek) inV. Inthe eventthatthe newly in a particular graph may not be available at a given generated hash value pi is less than MinHashValuei, time. This is because the edges for a given graph may thenwereplace(e,MinHashValuei)by(e,pi). Wealso not arrive contiguously. Furthermore, the data is typ- replacethe correspondingentryinI by(e,Id(G)). The ically very sparse and most values in the table are 0. overall approach for maintaining the hash indices for Therefore, we need an update process which can work each incoming edge is illustrated in Figure 1. withsparseunorderededgesanddoesnotusethevirtual We note that the process discussed in Figure 1 table explicitly. continuously maintains a summary representation of For each incoming graph edge e ∈ G in the stream the underlying graphs. Since these summary statistics (with graph identifier Id(G)), we generate k random are maintained in main memory, we can utilize them hash values h(1,Id(G)) ...h(k,Id(G)). The ith hash at any time in order to determine the coherent edge value h(i,Id(G)) is denoted by pi. We note that the patterns. This is because an analyst may require the hashfunction h(·,·) uses Id(G) as aninput, so thatthe classificationapproachtobeappliedatanytimeduring samerandomhashvaluecanbegeneratedwhenanedge the data stream computation. To do so, we construct fromthesamegraphGisencounteredatalaterstagein transactionsetsassociatedwiththestatistics(V,I). As thestream. Inordertogeneratetherandomhashvalue, discussedearlier,eachtransactioncontainsasetofedges we use a standard hash function on the concatenation which take on the same min-hash identifier in I (for a of the strings corresponding to the two arguments of particular hash sample from the k possible samples). h(·,·). This creates the transaction set T. Then, the problem Let L be the running estimate of the number of of determining edge sets with coherence at least α is as distinct edges encountered so far in the stream. We follows: maintain a set V of k·L running minimum hash values togetherwith a setI ofk·L correspondinggraphiden- Problem 1. Determine all frequent patterns in T with tifiers for which these minimum values were obtained. support at least α·k. The value of L will increase during progression of the data stream, as more and more distinct edges are re- This part of the procedure is applied only when an ceived by the stream. Each entry in V is of the form analyst is required to classify a set of examples, and (e,MinHashValue), andeachentry inI is ofthe form canbeperformedofflinebyusingthecurrentlyavailable (e,GraphIdentifier). There are k such entries in both statistics(V,I). Wenotethatsincethisisthestandard version of the frequent pattern mining problem, we AlgorithmUpdateCompressedMinHash(Incom. Edge: e∈G, SamplingRate: k,Min-hashvalues: V can use any of the well known algorithms for frequent Min-hashindices: I,ColumnCompressionSize: n); pattern mining. Furthermore, since the transaction {AssumethatgraphGisassociatedwithanumerical data is typically small enough to be main-memory identifierId(G)whichdenotes theorderinwhichit resident,wecanusespecializedfrequentpatternmining wasreceivedindatastream;} algorithmswhichareveryefficientfor the caseofmain- begin Generatek randomhashvaluesh(1,Id(G))...h(k,Id(G)); memory resident data. Mapeontoanintegerie in[1,n]withtheuseof It remains to analyze the accuracy of this random- auniformrandomhashfunction; ized approach. Because of the massive nature of the Applysamemin-hashapproachofofFigure1tothis underlying data sets, we may generate a fairly large edgeexceptthatweapplytoie insteadofe; number of patterns. We would like to minimize the end spuriously generated patterns in order to increase the effectiveness of our approach. Therefore, we increase Figure 2: Updating Min-hash Index for each Incoming the coherence threshold, so that we generate no false Graph with Column Compression positives with high probability. Specifically, we set the coherence threshold to α·(1+γ)·k, where γ ∈(0,1). Lemma 3.1. The probability of a pattern P determined results in a simultaneous row and column compression, from T to be a false positive (based on coherence prob- thoughthecolumncompressionisperformeddifferently. ability), when using a coherence threshold of α·(1+γ) The overall approach is similar to that of Figure 1, and k samples is given by at most e−α·k·γ2/3, where e is except that each edge in the data is first mapped to an the base of the natural logarithm. integer in the range [1,n] with the use of uniform ran- dom hash function, and then the steps of Figure 1 are Proof. LetYi be a0-1variablewhichtakesonthevalue applied. Thus, in this case, instead of determining pat- of 1, if the pattern P is included in the ith min-hash terns on the actual edges, we are determining patterns sample. Thus, Yi is a bernoulli random variable with on integers in the range [1,n]. The overall approach is probability less (cid:2)than α. The support of pattern P can illustrated in Figure 2. Note that this algorithm differs k be expressedas i=1Yi. By using the Chernoffbound, fromFigure 1 inthe sense thatit uses the columncom- we obtain the desired result. pressionsizenasaninput,andusesittomapeachedge ontoanintegerin[1,n]. This mapping is done withthe The above resultcanbe restatedto determine the min- useofauniformrandomhashfunction. We constructa imumnumberofsamplesrequiredinordertoguarantee string representation of the edge by concatenating the the bound. nodelabelsofthetwoends. Weapplythehashfunction Corollary 3.1. The number of samples k required in to this string representation. Other than this, most of the steps in Figure 2 are the same as those in Figure 1. order to guarantee a probability at most δ for any of The value of n is picked based in storage consider- the determined patterns to be a false positive is given by 3·ln(1/δ)/(α·γ2). ations, and is typically much lower than the number of distinct edges. The choice of n results from a natural Next, we will study the further enhancement of our tradeoffbetweenstoragesizeandaccuracy. Sincemulti- approach with a compressed virtual table. pleedgesmaymapontothesameinteger,thisapproach may result in some further degradationof the accuracy 3.2 Min Hash Approach with Compressed Vir- ofcomputation. Therefore,we will examinethe level to tual Table We note thatthe main constraintwith the which the edge coherence computation is additionally useoftheaboveapproachisthatthesummarystatistics degraded purely because of column compression. Thus, V andI havesizeO(k·L),whereListhenumberofdis- thetotaldegradationwouldbetheproductofthedegra- tinctedgesencounteredsofar. Theproblemhereisthat dations because of row and column compression. the value of L may be very large because of the mas- A key property of real data sets is that they are sive graph assumptions of this paper. Such large sum- oftensparse. Thispropertycanbeleveragedinorderto marizations cannot easily be maintained even on disk, guarantee the effectiveness of column compression. Let andcertainlynotinmainmemory,whichisrequiredfor us assume that the average number of edges contained an efficient update process in the data stream scenario. in the different graphs in the data stream is given Therefore, we will supplement our min-hash row com- by a. The sparsity assumption implies that a is pressionwithamoredirectcolumncompressioninorder typically extremely small compared to the number of tofurtherreducethe sizeofthe requiredsynopsis. This distinct edges and is also quite small compared to the compressedcolumncardinalityn. Therefore,weassume spuriously map edges onto P −Q. We compute only that a << n. We note that the coherence probability the probability that a pattern which is one edge away of patternP is expressedas the ratio f∩(P)/f∪(P). As from P gets counted as P, because in the case of a<< longasbothf∩(P)andf∪(P)areaccuratelyestimated, n, the probability of multiple collisions is extremely the coherence probability is accurately estimated as small. There are |P| such patterns, and by using a well. We note that both f∪(P) and f∩(P) are over- similar analysis to the previous proof, we can show estimated because all correct patterns are counted, thatthe probabilityof eachsuchpatternbeing counted whereas the hash result for some spurious edges may spuriously is given by a/n. Therefore, the overall collide with some of the columns corresponding to P. probability of spurious over-counting is given by a·|P| n Letf∩(cid:4)(P) andf∪(cid:4)(P)be the valuesoff∩(P)andf∪(P) as in the previous case. whenexhaustivelycountedoverthecolumncompressed table. Let |P| denote the number of edges in the The aboveresults show that both f∪(P) and f∩(P) are subgraph P. We make the following observations: very closely approximated by this approach. This is especially the case since useful values of |P| are quite Lemma 3.2. Let f(cid:4)(P) be the estimated support of ∪ small and typically less than 4 or 5. For example, con- P on the column-compressed data with the use of a sider a data set with a=10, n=10,000,and |P|=10. uniform hash functions. Then, the expected value of f(cid:4) (P) satisfies the following relationship: In such cases, both f∩(P) and f∪(P) are both approxi- cup mated to within 0.01of their true values.Therefore,the column compression scheme reduces the space require- a·|P| (3.1) f∪(P)≤E[f∪(cid:4)(P)]≤f∪(P)+ ments significantly at a relatively small degradation in n quality. Proof. We know that f∪(P) ≤ E[f∪(cid:4)(P)], since all true patterns are counted, but some spurious patterns are 3.3 Determining Discriminative Patterns In also counted because of collisions. this section, we will discuss how the discriminative pat- In order to compute the fraction of spurious pat- ternsaredeterminedforclassification. Inordertodeter- terns, we need to compute the probability that a trans- mine the discriminative patterns, we also need to keep actioncontainsspuriousedgesbecauseofcollisions. Let trackoftheclasslabelsduringthemin-hashingscheme. mi be the number of edges in the ith graph. The A mentioned earlier, we assume that the class labels of probability that any of the mi edges get mapped to thegraphsareappendedtotheidentifierId(G)foreach one of the |P| edges because of a collision is given by graphs G in Figure 1. These class labels can be used in mi·|P|/n. Therefore,bysummingthis probabilityover order to compute the discriminative patterns from the all graphs, we obtain the expected over-estimation of compressed summary of the data. We note that the (cid:2) the su(cid:2)pport. This is equalto |Pn|· Ni=1mi/N. We note global distribution of class labels in the min-hash sum- N mary may not be the same as the original data stream, that i=1mi/N is simply equal to a. By making the becauseofitsinherentbiasinrepresentinggraphidenti- correspondingsubstitution in the expression, we obtain fierswithlargernumberofedgesinthe summarytrans- the desired result. actionsetT. Therefore, wedirectly maintainthe global We can obtain a similar result for f(cid:4)(P). class distribution while scanning the data stream, since ∩ theglobalstatisticsarerequiredinordertocomputethe Lemma 3.3. Let f(cid:4)(P) be the estimated support of dominantinterestratioforanyparticularpattern. How- ∩ P on the column-compressed data with the use of a ever, for a particular pattern containing a fixed number uniform hash functions. Then, the expected value of of edges, we can make the following observation: f(cid:4) (P) approximately satisfies the following relation- cap Observation 3. The class fraction for any particular ship: pattern P and class computed over the transaction set a·|P| T is an unbiased estimate of its true value. (3.2) f∩(P)≤E[f∩(cid:4)(P)]≤f∩(P)+ n Theaboveobservationistruebecausetheclassdistribu- Proof. The lower bound is trivially true since all true tionisnotusedduringeitherthecolumncompressionor patterns are always counted. the row based min-hashing. Furthermore, we are com- For the case of the upper bound, we need to paring the class distribution among graphs containing estimate the probability that a graphcontains a proper onlyafixednumberofedges. Therefore,theprobability subset Q of the pattern P (Q ⊂ P), but it gets ofaparticulargraphbeingpickedasthefirstgraphcon- counted as a true pattern because of collisions which taining that pattern in the min-hash scheme is exactly the same. Therefore, while the overall distribution of mining algorithms on T in order to determine the co- classeswillnotbe thesameinthe globalmin-hashsum- herentpatternsandthendeterminingthediscriminative mary,the distributionofclassesfor a particular pattern patterns among these. In order to perform the classifi- willbethesame. Asinthecaseofcoherentpatterns,we cation, we determine all the α-coherentand θ-confident willre-defineourthresholdsinsuchawaysoastoallow patterns in the data. These patterns are sorted in or- no false positives. We note that the constraint on the der of the dominant class confidence. For a given test coherenceprobabilityensuresthatatleastk·αpatterns graph, we determine the first r patterns which occur as of each type are available in the min-hash index. This asubgraphofthe testgraph. The majorityvoteamong ensures that sufficient number of samples are available these patterns is reported as the relevant class label. tomeasuretheclass-discriminationbehavioraswell. As inthecaseofthecoherencethreshold,wecansetadom- 4 Experimental Results inant confidence threshold which is a factor (1+γ) of In this section we will present the experimental tech- its true value. niques for the effectiveness and efficiency of our clas- Lemma 3.4. The probability of a pattern P determined sification technique. Specifically, we will test over the fromT tobeafalsepositive(basedonclass-confidence), following measures. We tested our technique for effec- when usingadominant confidencethreshold of θ·(1+γ) tiveness, efficiency, and sensitivity to a variety of pa- rameters. We used the following data sets to test our and k samples for the min-hash approach is given by at most e−α·θ·k·γ2/3, where e is the base of the natural algorithms: DBLP Data Set1: Inthiscase,weconstructedgraphs logarithm. onthe data set in whichauthors were defined as nodes, Proof. Since each of the relevant patterns is coherent, and co-authorship in a particular paper was defined as wecanassumethatatleastα·ksamplesofthatpattern an edge. Each paper constituted a graph. The class areavailableinthe min-hashindex inorderto estimate labels constituted the following three possibilities: (1) the class distribution of a particular pattern. In order Database Related Conferences (2) Data Mining for a pattern to be a false-positive, its class presence Related Conferences, and (3) All remaining con- must be less than θ in the original data, but must have ferences. a presence greater than θ · (1 + γ) in the min-hash We note that there was considerable overlap in co- table. Therefore, the Chernoff bound implies that the authorship structure across different classes, especially estimation of the class fraction within a factor (1+γ) the first two. This made the classification problem of the true value with the use of at least α·k samples particularly challenging. The final data set contained has probability at least e−α·θ·k·γ2/3, where e is the base over 5∗105 authors and about 9.75∗105 edges (over of the natural logarithm. all graphs) for the training data. This corresponds to about 3.55∗105 different graphs. Thus, even though We can immediate invert the result in order to deter- the data set was large, the graphs were fairly sparse mine the minimum number of samples required for a with a very small number of edges in each. fixed probability δ. Sensor Streaming Data Set: This data con- Corollary 3.2. The number of samples k required in tained information about local traffic on a sensor order to guarantee a probability at most δ for any of network which issued a set of intrusion attack types. the determined patterns to be a false positive (based on Each graph constituted a local pattern of traffic in dominant class confidence) is given by 3·ln(1/δ)/(α·θ· the sensor network. The nodes correspond to the IP- γ2). addresses, and the edges correspond to local patterns Note that Corollary 3.1 differs from Corollary 3.2 of traffic. We note that each intrusion typically caused in the sense that the latter has an additional factor a characteristic local pattern of traffic, in which there θ < 1 in the denominator. Therefore, the sample size were some variations, but also considerable correla- k from Corollary3.2 will dominate the requiredsample tions. Each graph was associated with a particular size from Corollary 3.1. intrusion-type. Since there were over 300 different The min-hash structure is continuously maintained intrusion types, this made the problem particularly over the progress of the data stream. Since this struc- difficult in terms of accurate classification. Our data tureisfairlycompact,itcanbeusedatanytimeduring set Igraph0103-07 contained a stream of intrusion the progress of the data stream in order to classify a graphsfromJune1, 2007to June3, 2007. The training given set of test instances. This classification can eas- ily be done offline, by using standard frequent pattern 1DataSetURL:www.charuaggarwal.net/dblpcl/ 97 96.5 96 97 CLASSIFICATION ACCURACY (%)99992345999....3455555 2B−ADS EHLAINSHE (CDOISMKP BRAESSESDE DN NS−TCRLEAASMS CIFLIEARSS)IFIER CLASSIFICATION ACCURACY (%)99993456999....4565555 2B−ADS EHLAINSHE (CDOISMKP BRAESSESDE DN NS−TCRLEAASMS CIFLIEARSS)IFIER 920 20 40 60 80 100 120 140 160 180 200 93 ROW SAMPLE SIZE (k) 92.5 Figure3: Effectivenesswithdifferentmin-hashrowsizes 922000 4000 6000 8000 10000 12000 14000 16000 COLUMN SAMPLE SIZE (n) k for data set DBLP ; n=10000,α=0.05, θ =0.4 Figure5: Effectivenesswith differentcolumnhashsizes 32 n for data set DBLP ; k =50, α=0.05, θ =0.4 30 CLASSIFICATION ACCURACY (%)22222468 2B−ADS EHLAINSHE (CDOISMKP BRAESSESDE DN NS−TCRLEAASMS CIFLIEARSS)IFIER 3346 32 20 kFigfourred4a:taEff1s80eetct2I0ivgrea4n0pehs6s001w0Ri8O30tWh- SA0M1dP07L0Ei SffIZ;E1 e2(k0)rnen14=0tm1160i0n01-800h0a,2s00hαro=w0si.z0e5s, CLASSIFICATION ACCURACY (%)22234680 2−D HASH COMPRESSED STREAM CLASSIFIER 22 BASELINE (DISK BASED NN−CLASSIFIER) θ =0.4 20 128000 4000 6000 8000 10000 12000 14000 16000 COLUMN SAMPLE SIZE (n) data was constructed by uniformly sampling about 90% of the size of a larger base data set. The test data Figure6: Effectivenesswith differentcolumnhashsizes contained the remaining 10% of the size of the base nfordatasetIgraph0103-07 ; k=50,α=0.05,θ =0.4 data. The training data stream contained more than 1.57∗106 edges in the aggregate. 4.1 Effectiveness Results Since there are no knowncompetinggraphalgorithmsforthestreamcase, 97 wedesignedabaselinewhichworksforthediskresident 96.5 case. While such a technique will not scale over mas- 96 sodsgicurevaasrenipgtsdhneascet,thhdaanenasiqdtdrudaeteithasakmeo-nvfbsrea,corsoimtemcdoispmdnuuipestskaeeer,stfeuitnrslhetgt-eoonmcretlgieoegatsshnhtebiostoztdhersgse.rcetalaTffhpseehhsceitetfiriodevegfretoehnrsweeesh,itsniewctoshoetf CLASSIFICATION ACCURACY (%)99934599...45555 2B−ADS EHLAINSHE (CDOISMKP BRAESSESDE DN NS−TCRLEAASMS CIFLIEARSS)IFIER instance. We note that the approach needs two scans; 93 one for reorganizingthe edges into graphs, and another 92.5 for determining the closest graph to the current target. 920 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 COHERENCE THRESH (alpha) Sincetheapproachrequirestwoscans,itcanbeusedfor disk-resident data, but not for data streams. Further- Figure 7: Effectiveness with different coherence thresh- more, the efficiency of the nearest neighbor approach olds α for data set DBLP ; k =50, n=10000, θ =0.4 reduces with increasing data set size; therefore, it can- not practically be used for very large data sets in an 32 neighbor technique with the use of a horizontal line. We note that the parameter on the X-axis corresponds 30 to a parameter of the min-hash scheme (only), and CLASSIFICATION ACCURACY (%)22222468 2B−ADS EHLAINSHE (CDOISMKP BRAESSESDE DN NS−TCRLEAASMS CIFLIEARSS)IFIER tnahhlaolestrcheav-fbasoerarsyse,edatthhccreeloacsasslcsaicsdfiusieirfffiraeccwrayeatnisootfmnvtuaahclcuehceubhsraiagoschfeyelitrnhoetfehttaphenaecrhtachnomeimqebutpaeersree.dslsoineIendes nearestneighbortechnique. Thiswasinspiteofthefact 20 that the hash-based classifier can be used for streams 180 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 of arbitrary length, whereas the baseline technique COHERENCE THRESH (alpha) requires two passes, and can only be used with a disk Figure 8: Effectiveness with different coherence thresh- resident database. Furthermore, the technique is not olds α for data set Igraph0103-07 ; k = 50, n = 10000, scalable to very large databases, because of its need θ =0.4 to compute an increasing number of distance values with progression of the stream. In each of these cases, the classificationaccuracy was not very sensitive to the variationin the min-hash rowsize. The data set DBLP efficient way. Nevertheless, the approach is useful as a wastheleastsensitive,whereasthedatasetIgraph0103- baseline for testing the quality of the approach. 07 was somewhat more sensitive. This is because the We tested the variation in the quality of our ap- sizes of the individual graphs in Igraph0103-07 were proach by changing various input parameters. Two larger. Therefore, the min-hash technique was able to kinds of parameters were varied: (1) Compression take better advantage of the structural information in parameters which determine the size of the data the graphs with increased row sizes. In spite of the structure: These include the row compression size k differences in accuracies across the various data sets, it and column compression size n. Clearly, the use of is clear that the min-hash technique was significantly larger row and column sizes can increase the effective- superior to the baseline method in all cases. We note nessoftheapproachattheexpenseofgreaterspace. We that the absolute classification accuracies should be willexaminethistradeoffinsomedetail. (2)Variation understood in context of the fact that these data sets in coherence threshold α and confidence thresh- contained over 300 different classes. Therefore, this is old θ: We will test the sensitivity of the approachwith an extreme case of classification into a large number of the parameters α and θ. rare classes, and the absolute classification accuracies Unless otherwise mentioned, the default values of arequitehighwhenwetakethisfactintoconsideration. the parameters were k = 50, n = 10000, α = 0.05 InFigures5,and6wehaveillustratedthevariations and θ = 0.4. Furthermore, since our goal is to test intheeffectivenessofthe schemewithdifferentlevelsof the experimental effectiveness for different thresholds column compression for the two data sets. The X-axis values of α and θ, we did not reset them with the use illustratestheincreasingcolumnsamplesizen, whereas of the γ parameter. This means that γ was effectively the Y-axis illustrates the classification accuracy. As in set to 0, and this allowed a greater number of false the previous case, we have illustrated the (constant) positivesoversmall valuesofk. We will showthateven baseline accuracy with a horizonal time, since the inthisscenario,ourtechniqueprovidesveryhighquality parameters on the X-axis relate to the min-hashing results in terms of final classification. This shows that schemeonly. Itisevidentthatthevariationinaccuracy ourtechniqueprovidesgreatexperimentalrobustnessin depends upon the compression scheme used. As in the addition to its theoretical properties. We further note previouscase,therewasverylittlevariationinaccuracy that at the default value of the parameters, the space across different column sizes for the case of the DBLP occupied by the min-hash structure is of the order of data set. The variations are much more significant for only1MB.Thiscantypicallybecapturedinthememory the Igraph0103-07 data set. We also note that the limitations of even very space-constrained devices. In variations are more significant for the case of column Figure 3 and 4, we have illustrated the variation in compression as compared to row compression. This effectiveness of the scheme with increasing min-hash is partially because the column-wise hashing scheme row size for the two data sets. The X-axis contains can sometimes map edges with correlations to different the min-hash row size k, while the Y-axis contains the classestothesamecolumn. Such“mixing”canresultin classification accuracy. The other parameters n, α and some reduction of the classification accuracy. However, θ were set to their default values discussed above. We in each case, the 2-d hash compressed scheme was havealsoillustratedtheaccuracyofthebaselinenearest
Description: