Multi-criteria Anomaly Detection using Pareto Depth Analysis 1 1 0 Ko-Jen Hsiao∗, Kevin S. Xu†, and Alfred O. Hero III‡ 2 t EECS Department, University of Michigan, Ann Arbor, MI c 48109-2122, USA O 7 1 October 18, 2011 ] G L Abstract . s We consider the problem of identifying patterns in a data set that c [ exhibit anomalous behavior, often referred to as anomaly detection. In most anomaly detection algorithms, the dissimilarity between data sam- 1 plesiscalculatedbyasinglecriterion. Whendissimilaritiesarecalculated v by multiple criteria, one might perform anomaly detection using a lin- 1 ear combination of the multiple dissimilarities. If the importance of the 4 different criteria are not known in advance, the algorithm may need to 7 3 be executed multiple times with different choices of weights in the linear . combination, perhaps selected by grid search. 0 Inthispaper,weintroduceanovelnon-parametricmulti-criteria ano- 1 malydetectionmethodusingPareto depthanalysis (PDA).PDAusesthe 1 concept of Pareto optimality to detect anomalies under multiple criteria 1 : without having to run an algorithm multiple times with different choices v of weights. The proposed PDA approach scales linearly in the number i X of criteria, unlike grid search methods, which scale exponentially. We find that PDA is able to outperform any linear combination of criteria, r a including weights selected by grid search. 1 Introduction Anomaly detection is an important problem that has been studied in a variety ofareasandusedindiverseapplicationsincludingintrusiondetection,fraudde- tection,andimageprocessing(Hodge and Austin,2004;Chandola et al.,2009). Many methods for anomaly detection have been developed using both para- metric and non-parametric approaches. Non-parametric approaches typically ∗Email: [email protected] †Email: [email protected] ‡Email: [email protected] 1 involve the calculation of dissimilarities between data samples. For complex high-dimensionaldata,severaldissimilaritymeasurescorrespondingtodifferent criteria may be required to detect different types of anomalies. For example, consider the problem of detecting anomalous object trajectories in video se- quences. Multiple criteria, such as dissimilarity in object speeds or trajectory shapes,canbeusedtodetectagreaterrangeofanomaliesthanasinglecriterion. In order to perform anomaly detection using these multiple criteria, one could first combine the dissimilarities using a linear combination. However, in many applications, the importance of the different criteria are not known in advance. Itis difficulttodetermine howmuchweighttoassigntoeachdissimilaritymea- sure, so one may have to choose multiple weightings using, for example, a grid search. Furthermore, when the weights are changed, the anomaly detection algorithm needs to be re-executed using the new weights. In this paper we propose a novel non-parametric multi-criteria anomaly detection approach using Pareto depth analysis (PDA). PDA uses the concept of Pareto optimality to detect anomalies without having to choose weights for differentcriteria. Paretooptimalityisthetypicalmethodfordefiningoptimality when there may be multiple conflicting criteria for comparing items. An item is said to be Pareto-optimal if there does not exist another item that is better or equal in all of the criteria. An item that is Pareto-optimal is optimal in the usual sense under some combination, not necessarily linear, of the criteria. Hence, PDA is able to detect anomalies under multiple combinations of the criteria without explicitly forming these combinations. The PDA approach involves creating dyads corresponding to dissimilarities between pairs of data samples under all of the dissimilarity measures. Sets of Pareto-optimal dyads, called Pareto fronts, are then computed. Nominal and anomalous samples are located near different depths of Pareto fronts; thus us- ing the front depths of the dyads corresponding a test sample is an effective way to discriminate between nominal and anomalous samples. The proposed PDAapproachscaleslinearly inthenumberofcriteria,whichisasignificantim- provementcomparedtoselectingmultipleweightsviaagridsearch,whichscales exponentiallyinthenumberofcriteria. Furthermore,wefindthatPDAoutper- forms anylinear combinationof criteriawhen usedwith severalstate-of-the-art anomaly detection algorithms in two experiments involving both synthetic and real data sets. The rest of this paper is organized as follows. We first provide an introduc- tion to Pareto fronts and discuss related work in Section 2. We then introduce theconceptofadyadandrelateParetofrontsofdyadstotheanomalydetection problem in Section 3. The PDA anomaly detection algorithm is then described in Section 4. Finally we present two experiments in Section 5 to evaluate the performance of PDA. 2 2 Background 2.1 Pareto fronts The PDA method proposedin this paper utilizes the notion of Pareto optimal- ity, which has been studied in many application areas in economics, computer science, and the social sciences among others. We provide an introduction to Pareto optimality and define the notion of a Pareto front. Considerthe followingproblem: givennitems,denotedbythe setS, andK criteria for evaluating each item, denoted by functions f1,...,fK, select x∈S thatminimizes[f1(x),...,fK(x)]. Inmostsettings,itisnotpossibletoidentify a single item x that simultaneously minimizes f (x) for all i ∈ {1,...,K}. In i ∗ ∗ order to decrease criterion i, i.e. by finding x such that f (x )<f (x), it may i i ∗ be necessary to increase criterion j, i.e. f (x ) > f (x). A minimizer can be j j found by combining the K criteria using a linear combination of the f ’s and i finding the minimum of the combination. Different choices of weights in the linear combination could result in different minimizers; a set of items that are minimizers under some linear combination can then be created by using a grid search over the weights, for example. A more powerful approach involves finding the set of Pareto-optimal items. Anitemxissaidtostrictly dominate anotheritemx∗ ifxis nogreaterthanx∗ ∗ ineachcriterionandxislessthanx inatleastonecriterion. Thisrelationcan ∗ ∗ ∗ bewrittenasx≻x iff (x)≤f (x )foreachi∈{1,...,K}andf (x)<f (x ) i i i i for some i. The Pareto front, which corresponds to the set of Pareto-optimal items,isthesetofitemsinS thatarenotstrictlydominatedbyanotheritemin S. Itcontainsalloftheminimizersthatarefoundusinglinearcombinations,but also includes other items that cannot be found by linear combinations. Denote the Pareto front by F1, which we call the first Paretofront. The second Pareto frontcanbeconstructedbyfindingitemsthatarenotstrictlydominatedbyany of the remaining items, which are members of the set S \F1. More generally, define the ith Pareto front by i−1 Fi =Pareto front of the set S\ Fj. j[=1 For convenience, we say that a Pareto front F is deeper than F if i>j. i j We illustrate the concept of Pareto fronts on a toy example. We draw 50 samplesfroma uniformdistributionon[0,1]×[0,1]. The ithsample is denoted by[xi,yi]. K =2criteriaareusedtoevaluateeachsample,wheref1([x,y])=x and f2([x,y]) = y. The first five Pareto fronts for one realization are shown in Figure 1. Notice that the fronts create layers of the data. Like an onion, the second layer is revealed by removing the first layer, and the third layer is revealedby removing the secondlayer,and so on. As we show in Section 3, the depths of the Pareto fronts carry valuable information for anomaly detection. Notice alsothatlinearcombinationscanonly identify four ofthe sevensamples on the first Paretofront, namely the convexportion of the front. This provides 3 1 0.8 5th 4th 0.6 3rd 2 C 0.4 2nd 0.2 1st 0 0 0.2 0.4 0.6 0.8 1 C1 Figure 1: First 5 Pareto fronts on a toy example of 50 samples uniformly dis- tributedon[0,1]×[0,1]. Thefrontscreateonion-likelayersofthedata. Samples onthe firstParetofrontindicatedbyared‘*’areminimizers undersomelinear combination. someintuitionastowhy the proposedmethodbasedonParetofrontsis ableto outperform methods using linear combinations. 2.2 Related work Several machine learning methods utilizing Pareto optimality have previously beenproposed;anoverviewcanbefoundinJin and Sendhoff(2008). Theseme- thodstypicallyformulatemachinelearningproblemsasdifficultmulti-objective optimization problems. The goal is then to obtain a set of Pareto-optimal so- lutions using evolutionary algorithms or other methods for solving these multi- objective optimization problems. These methods differ from our use of Pareto optimality in that they use only the first Pareto front, while we use multiple Pareto fronts. Furthermore, we consider Pareto fronts created from a finite set ofitems,sowedonotneedtoemploysophisticatedmethodsinordertogenerate these fronts. Hero and Fleury (2004) introduced a method for gene ranking and filtering using Pareto fronts that is closely related to our approach. The method es- tablishes a ranking of genes according to the estimated probability that each gene is Pareto-optimal. It differs from the method proposed in this paper in that the Pareto fronts are computed on the data samples themselves, i.e. the genes, rather than on dyads associated with the data samples. We introduce the concept of a dyad in Section 3.1. Anotherrelatedareaismulti-viewlearning(Blum and Mitchell,1998;Sindhwani et al., 2005), which involveslearningfromdata where observationsare representedby multiple sets of features, commonly referred to as “views”. In such case, train- 4 ing in one view helps to improve learning in another view. The problem of view disagreement, where samples take different classes in different views, has recently been investigated (Christoudias et al., 2008). The views are similar to criteria in our problem setting. However, in our setting, different criteria may be orthogonal and could even give contradictory information; hence there may be severe view disagreement. As a result, training in one view could actually worsen performance in another view, so the problem we consider differs from multi-view learning. Finally, many other anomaly detection methods have previously been pro- posed. Hodge and Austin (2004) and Chandola et al. (2009) both provide ex- tensive surveysof different anomaly detection methods and applications. Near- est neighbor-basedmethods are closely relatedto the proposed PDA approach. Byers and Raftery(1998)proposedtousethedistancebetweenasampleandits kth-nearestneighborastheanomalyscoreforthesample;similarly,Angiulli and Pizzuti (2002) and Eskin et al. (2002) proposed to the use the sum of the distances between a sample and its k nearest neighbors. Breunig et al. (2000) used an anomalyscorebasedonthelocaldensityofthek nearestneighborsofasample. Hero (2006) introduced a non-parametric adaptive anomaly detection method using geometric entropy minimization, which is based on random k-point mini- malspanningtrees. Zhao and Saligrama(2009)proposedananomalydetection algorithm k-LPE using local p-value estimation (LPE) based on a k-nearest neighbor graph. This algorithm is provably optimal for a specified level when thetestsamplesaredrawnfromamixtureofthenominaldensityandauniform density. All of the aforementioned methods are designed for single-criteria anomaly detection. In the multi-criteria case,the single-criteriaalgorithmsmust be exe- cuted multiple times with different weights, unlike the PDA anomaly detection algorithm that we propose. 3 Pareto depth analysis 3.1 Problem description Assume that a training set XN ={X1,...,XN} of d-dimensional vectors Xi is available. GivenanewvectorX,theobjectiveofanomalydetectionistodeclare X tobe ananomalyifX is significantlydifferentfromsamplesinX . Suppose N that K > 1 different evaluation criteria are given. Denote these criteria by C1,...,CK. Each criterion is associated with a specific measure for computing dissimilarities. DenotethedissimilaritybetweenX andX computedusingthe i j measure corresponding to criterion C by d (i,j). k k We define a dyad by Dij =[d1(i,j),...,dk(i,j)]T ∈RK+,i∈{1,...,N},j ∈ {1,...,N}\i. Each dyad D corresponds to a connection between vectors X ij i and X . Therefore, there are in total N different dyads. For convenience, j 2 denote the set of all dyads by D and (cid:0)the(cid:1)space of all dyads RK+ by D. By the definition of strict dominance in Section 2.1, a dyad D strictly dominates 5 ∗ ∗ ∗ another dyad D if D(i) ≤ D (i) for all i ∈ {1,...,K} and D(i) < D (i) for some i, where D(i) denotes the ith component of D. Recall that each dyad correspondsto a connection between two samples. Therefore, if D ≻D , the ij ik dissimilarity between X and X is no greater than that between X and X i j i k under any combinationofthe K criteria. The firstParetofrontF1 corresponds to the set of dyads from D that are not strictly dominated by any other dyads fromD. ThesecondParetofrontF2 correspondstothesetofdyadsfromD\F1 that are not strictly dominated by any other dyads from D\F1, and so on, as defined in Section 2.1. Recall that we refer to F as a deeper front than F if i j i>j. 3.2 Connection to anomaly detection IfadyadD islocatedinashallowfront,thenthedissimilaritybetweenthetwo ij samples X and X corresponding to D would be small under some combina- i j ij tionoftheK criteria. ForeachsampleX ,thereareN−1dyadscorresponding n to its connections with the other N −1 samples. Define the set of N −1 dyads associated with X by Dn. If most dyads in Dn are located at shallow Pareto n fronts,thenthedissimilaritiesbetweenX andtheotherN−1samplesaresmall n under certain combinations of the criteria. Thus, X is likely to be a nominal n sample. This is the basic idea of the proposedanomalydetection method using PDA and will play an important role in the following section. We construct Pareto fronts F1,...,FM of the dyads from the training set, where the total number of fronts M is the required number of fronts such that each dyad is a member of a front. When a test sample X is obtained, we createnewdyadscorrespondingtoconnectionsbetweenX andtrainingsamples. Similartomanyotheranomalydetectionmethods,weconnecteachtestsample toitsknearestneighbors. Sincewearedealingwithmultiplecriteria,weconnect each test sample to its k nearest neighbors with respect to each criterion, and we do so using different values of k for each criterion, which we denote by k1,...,kK. We delay the discussion of how to choose these ki’s to Section 4.2. We create s = K k new dyads, which we denote by the set Dnew = i=1 i {D1new,D2new,...,DsnePw}, corresponding to the connections between X and the unionof the k nearestneighborsin eachcriterioni1. In other words,we create i a dyad between X and X if X is among the k nearest neighbors of X in any j j i criterion i. We say that Dnew is below a front F if i l Dnew ≻D for some D ∈F , i l l l i.e. Dnew strictly dominates at least a single dyad in F . Define the depth of i l Dnew by i e =min{l|Dnew is below F }. (1) i i l 1Ifatrainingsampleisoneoftheki nearestneighborsinmcriteria,thenmcopiesofthe dyad corresponding to the connection between the test sample and the training sample are created. 6 40 nominal samples and two test samples 6 5 4 y3 2 1 0 0 1 2 3 4 5 6 x Dyads corresponding to test point * Dyads corresponding to test point + 6 6 5 5 4 4 ∆ y| 3 ∆ y| 3 | | 2 2 1 1 0 0 0 1 2 3 4 5 0 1 2 3 4 5 |∆ x| |∆ x| Figure 2: Top: 40 training samples (blue circles) generatedfrom bivariate mix- ture densityandtwotest samples(‘*’and‘+’). Left: Alldyadsforthe training samples (blue dots) along with first 20 Pareto fronts (green lines). Dyads asso- ciatedwithtestpoint‘*’(redcircles)concentratearoundshallowfronts. Right: Dyadsassociatedwithtestpoint‘+’(redcircles)concentratearounddeepfronts. Therefore if e is large, then Dnew is around deep fronts, so the distance be- i i tweenX andthe correspondingtrainingsample is largeunder all combinations of the K criteria. If e is small, then Dnew is around shallow fronts, so the i i distancebetweenX andthe correspondingtrainingsample issmallunder some combination of the K criteria. To give a concrete example, Figure 2 shows the positions of 40 training samplesand2testsamplesinR2 alongwiththefirst20Paretofrontsusingtwo criteria, the absolute value of the differences in the x and y dimensions. For demonstrative purposes, the test samples are connected to all of the training samples, i.e. k1 = k2 = N. From Figure 2, the ‘+’ is likely to be anomalous under any weighting of the two criteria while the ‘*’ is likely to be nominal. NoticethatthedistributionsofdyadsinDassociatedwiththesetwotestsamples 7 arequitedifferent. ThedifferenceinthesedistributionsisusedinPDAtodetect anomalies through the creation of an anomaly score. 4 Multi-criteria anomaly detection Up to now, we have shown that the distribution of dyads in D with respect to the Pareto fronts can be very different for an anomalous sample compared to a nominal one. We formalize this idea to develop a multi-criteria anomaly detection algorithm. 4.1 Anomaly score We first define an anomaly score for each test sample. We start with the ob- servation from Figure 2 that dyads corresponding to a nominal test sample are typically located near shallower fronts than dyads corresponding to an anoma- lous test sample. Therefore, the cumulative distribution function (cdf) of e i correspondingto ananomaloussample shouldbe quite differentfromthat of e i correspondingto a nominalsample. Given Dnew, let z(i) denote the fractionof dyads in Dnew below F for i=1,...,M. That is, i {Dnew ∈Dnew|Dnew is below F } j j i z(i)= . (cid:12) |Dnew| (cid:12) (cid:12) (cid:12) From(1),itcanbe seenthatz isthe cdfofthe e ’s foraparticulartestsample. i A plot of z as a function of front depth for the two test samples in Figure 2 is shown in Figure 3. Notice that the two cdf’s do indeed differ significantly. The cdf for the anomalous sample ‘+’ grows more slowly as a function of the front depth because the dyads corresponding to the ‘+’ are around deeper fronts as compared to the dyads corresponding to the ‘*’. For each test sample X, we define an anomaly score v(X) to capture the information contained in the cdf corresponding to X. We choose the anomaly score to be the area above the cdf of e corresponding to X. That is, i M v(X)= (1−z(i)). Xi=1 A benefit of this choice for anomaly score is that it can be calculated without firstcomputingthecdfofe bynotingthatitissimplythemeanofthee ’s. This i i ∞ is easily shown using the identity E[e ] = Pr(e ≥ j). Thus the anomaly i j=1 i score canbe easily computed and comparePdto the decision threshold using the test s 1 v(X)= e ≷σ. i s Xi=1 X is declared to be an anomaly if v(X)>σ, where σ is the decision threshold. 8 1 Test sample * 0.8 0.6 z 0.4 Test sample + 0.2 0 0 20 40 60 80 100 Front number Figure3: Cumulativedistributionfunctions (cdf’s) ofe ’scorrespondingtotest i samples‘*’and‘+’fromFigure2. Thecdfforthe‘+’growsmoreslowlybecause the dyads corresponding to the ‘+’ are around deeper fronts. 4.2 Selection of parameters RecallfromSection3.2thatweconnecteachtestsampleX toatrainingsample X if X is one of the k nearest neighbors of X in terms of the dissimilarity j j i measure defined by criterion i for any i ∈ {1,...,K}. We now discuss how these parameters k1,...,kK can be selected. For simplicity, first assume that there is only one criterion, so that a single parameter k is to be selected. PDA is able to detect an anomaly if the distribution of its dyads with respect to the Pareto fronts differs from that of a nominal sample. More formally, the area above the cdf of e corresponding to an anomalous sample must be higher than i that of a nominal sample. If k is chosen too small, this may not be the case, especially if there are training samples present near an anomalous sample, in which case, the dyads corresponding to the anomalous sample may reside near shallow fronts much like a nominal sample. On the other hand, if k is chosen too large, many dyads may correspond to connections to training samples that are far away, even if the test sample is nominal, which also makes the cdf’s of nominal and anomalous samples more similar. We propose to use the properties of k-nearest neighbor graphs (kNNGs) constructedonthe training samplesto selectthe number oftrainingsamplesto connect to each test sample. We construct symmetric kNNGs, i.e. we connect samples i and j if i is one of the k nearest neighbors of j or j is one of the k nearestneighborsofi. Webeginwithk =1andincreasekuntilthekNNGofthe training samples is connected, i.e. there is only a single connected component. By forcing the kNNG to be connected, we ensure that there are no isolated regions of training samples. Such isolated regions could possibly lead to dyads corresponding to anomalous samples residing near shallow fronts like nominal samples, which is undesirable. By keeping k small while retaining a connected 9 kNNG, we are trying to avoid the problem of having too many dyads so that even a nominal sample may have many dyads located near deep fronts. This methodofchoosingktoretainconnectivityhasbeenusedasaheuristicinother machine learning problems, such as spectral clustering (von Luxburg, 2007). Note that by requiring the kNNG to be connected, we are effectively imposing a one class assumption. If the nominal distribution consists of multiple well- separated classes, such an approach may not work well. Now let’s return to the situation PDA was designed for, with K different criteria. For each criterion i, we construct a k NNGs using the correspond- i ing dissimilarity measure and increase k until the k NNG is connected. We i i then connect each test sample to s = K k training samples. Note that we i=1 i are choosing each ki independent of tPhe other criteria, which is likely subop- timal. In principle, an approach that chooses the k ’s jointly could perform i better; however, such an approach would add to the complexity. Note that we choose separate k ’s for each criterion, which we find is necessary to obtain i goodperformancewhendifferentdissimilaritieshavevaryingscalesandproper- ties. Thereare,however,pathologicalexampleswheretheindependentapproach could choose k ’s poorly, such as the well-known example of two moons. These i examples typically involve multiple classes, which break the one-class assump- tion we effectively impose by requiring a connected k NNG. How to choose the i k ’s in a multi-class environment is beyond the scope of this paper and is an i area for future work. 4.3 Algorithm Pseudocodefor trainingthe PDAanomalydetectorisshowninAlgorithm1. It involves creating N dyads corresponding to all pairs of training samples then 2 selecting the valu(cid:0)es(cid:1)of the parameters k1,...,kK that determine how many training samples each test sample is connected to. Computing all pairwise dis- similaritiesineachcriterionrequiresO(dKN2)floating-pointoperations(flops), where d denotes the number of dimensions involved in computing a dissimilar- ity. Constructionofthe Paretofronts is done using non-dominatedsorting. We use the fast non-dominated sort of Deb et al. (2000) that constructs all of the Pareto fronts using O(KN4) comparisons in the worst case where each front consists of a single dyad. Pseudocode for detecting whether a test sample is anomalous is shown in Algorithm 2. It involves creating dyads between the test sample and the k l nearest training samples in criterion l, which requires O(dKN) flops. For each dyadDnew, weneed tocalculate the depth e . This involvescomparingthe test i i dyadwithtrainingdyadsonmultiplefrontsuntilwefindatrainingdyadthatis dominated by the test dyad. e is the front that this training dyad is a part of. i Usingabinarysearchtoselectthefrontandanotherbinarysearchtoselectthe 2 training dyads within the front to compare to, we need to make O(Klog N) comparisonsintheworstcasetocomputee . Theanomalyscoreiscomputedby i takingthemeanofthe se ’scorrespondingto thetestsample;the scoreis then i comparedagainstathresholdσ to determine whetherthe sampleis anomalous. 10