ebook img

Linear-Complexity Exponentially-Consistent Tests for Universal Outlying Sequence Detection PDF

0.46 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Linear-Complexity Exponentially-Consistent Tests for Universal Outlying Sequence Detection

Linear-Complexity Exponentially-Consistent Tests for Universal Outlying Sequence Detection Yuheng Bu Shaofeng Zou Venugopal V. Veeravalli University of Illinois at Urbana-Champaign Email: [email protected], [email protected], [email protected] February 8, 2017 7 1 0 2 Abstract b e Westudyauniversaloutlyingsequencedetectionproblem,inwhichthereareM sequencesofsamples F out ofwhicha smallsubset of outliersneed tobedetected. A sequence isconsidered asan outlierif the observations therein are generated by a distribution different from those generating the observations in 6 the majority of the sequences. In the universal setting, the goal is to identify all the outliers without ] any knowledge about the underlying generating distributions. In prior work, this problem was studied T as a universal hypothesis testing problem, and a generalized likelihood (GL) test was constructed and I its asymptotic performance characterized. In this paper, we propose a different class of tests for this . s problem based on distribution clustering. Such tests are shown to be exponentially consistent and their c timecomplexityislinearinthetotalnumberofsequences,incontrastwiththeGLtest,whichhastime [ complexitythatisexponentialinthenumberofoutliers. Furthermore, ourtestsbasedonclusteringare 3 applicable to more general scenarios. For example, when both the typical and outlier distributions form v clusters, the clustering based test is exponentially consistent, but the GL test is not even applicable. 4 8 0 1 Introduction 6 0 . 1 In this paper, we study a universal outlying sequence detection problem, where there are M sequences 0 of samples out of which a small subset are outliers and need to be detected. Each sequence consists of 7 independent and identically distributed (i.i.d.) discrete observations. It is assumed that the observations in 1 : the majority of the sequences are distributed according to typical distributions. A sequence is considered v as an outlier if its distribution is different from the typical distributions. We are interested in the universal i X setting of the problem, where nothing is known about the outlier and typical distributions, except that the r outlier distributions are different from the typical distributions. The goal is to design a test, which does not a depend on the typical and outlier distributions, to best discern all the outliers. Outlying sequence detection finds possible applications in many domains (e.g., [1–3]). The problem of outlying sequence detection was studied as a universal outlier hypothesis testing problem for discrete ob- servations in [4] and continuous observations in [5]. In [4], the exponential consistency of the generalized likelihood (GL) test under various universal settings was established. When there are a known number of identically distributed outliers, the GL test was shown to be asymptotically optimal as M goes to infinity. However, the high time complexity, which is exponential in the number of outliers T, is a major drawback of the GL test when M and T are large. In this paper, we propose tests based on distribution clustering [6] forvariousscenarios. Suchtestsareshowntobeexponentiallyconsistent,withtimecomplexitythatislinear in M and independent of T. The intuition for the clustering based test is that the typical distributions are usually closer to each other thantheoutlierdistributions. Therefore,thetypicaldistributions(andalsopossiblytheoutlierdistributions) 1 will naturally form a cluster. This implies that the outlying sequence detection problem is equivalent to clustering the distributions using KL divergence as the distance metric (see also [7]). We note that our problem is different from the classical distribution clustering problem [6,8–10]. In the distribution clustering problem, the goal is to find the cluster structure with the lowest cost (sum of distance functions of each point in the cluster to the center). However, for our problem, we are given a sequence of samples from each distribution rather than the actual underlying distribution itself. Therefore, we are interested in the statistical performance, i.e., the error probability, of our test. Previous studies on approximation algorithms for distribution clustering [8–10] only show that the cost corresponding to the cluster structure returned by the approximation is within a logK factor of the minimal cost, where K is the numberofclusters. Andtherearenoresultsshowingthattheapproximationalgorithmswillconvergetothe minimal cost. Therefore, their results cannot be directly applied to our problem to provide a performance guarantee in a probabilistic sense. Ourcontributionsinthispaperareasfollows: (1)inallcaseswheretheGLtestisexponentiallyconsistent, weconstructtestsbasedonclusteringthatarealsoexponentiallyconsistentandhavetimecomplexitylinear in M; (2) we show that the tests based on clustering are applicable to more general scenarios. For example, when both the typical and outlier distributions form clusters, the clustering based test is exponentially consistent, but the GL test is not even applicable. The rest of the paper is organized as follows. In Section 2, we describe the problem model. In Section 3, we introduce the GL test studied in [4]. In Section 4, we reformulate the outlying sequence detection problem as a distribution clustering problem. In Section 5, we propose linear complexity tests based on the K-means clustering algorithm. In Section 6, we provide some numerical results. 2 Problem Model Throughout the paper, all distributions are defined on the finite set Y, and P(Y) denotes the set of all probability mass functions (pmfs) on Y. All logarithms are with respect to the natural base. Weconsideranoutlyingsequencedetectionproblem,wherethereareintotalM ≥3datasequencesdenoted by Y(i) for i = 1,...,M. Each data sequence Y(i) consists of n i.i.d. samples Y(i),...,Y(i). The majority 1 n of the sequences are distributed according to typical distributions except for a subset S outlying sequences, where S ⊂ {1,...,M} and 1 ≤ |S| = T < M. Each typical sequence j is distributed according to a typical 2 distribution π ∈ P(Y), j ∈/ S. Each outlying sequence i is distributed according to an outlier distribution j µ ∈P(Y), i∈S. Nothing is known about µ and π except that µ (cid:54)=π , ∀i∈S, j ∈/ S, S ⊂{1,...,M}, i i j i j and all of them have full support over a finite alphabet Y. We first study the case that all typical distributions are identical, i.e., π =π, ∀j ∈/ S. j We then study the case where both the outlier distributions {µ } and the typical distributions {π } i i∈S j j∈/S are distinct. Moreover, the typical distributions and the outlier distributions form two clusters. More specifically, maxD(µ (cid:107)µ )< min {D(µ (cid:107)π ),D(π (cid:107)µ )}, i j i j j i i,j∈S i∈S,j∈/S maxD(π (cid:107)π )< min {D(µ (cid:107)π ),D(π (cid:107)µ )}. (1) i j i j j i i,j∈/S i∈S,j∈/S Thisconditionmeansthatthedivergencewithinthesameclusterislessthanthedivergencebetweendifferent clusters. Weusethenotationy(i) =(y(i),...,y(i)),wherey(i) ∈Y denotesthek-thobservationofthei-thsequence. 1 n k LetS bethesetcomprisingallpossibleoutliersubsets. Forthehypothesiscorrespondingtoanoutliersubset 2 S ∈S, the joint distribution of all the observations is given by (cid:16) (cid:17) p (yMn)=L yMn,{µ } ,{π } S S i i∈S j j∈/S n = (cid:89)(cid:110)(cid:89)µ (y(i))(cid:89)π (y(j))(cid:111), (2) i k j k k=1 i∈S j∈/S whereL (cid:0)yMn,{µ } ,{π } (cid:1)denotesthelikelihood,whichisafunctionoftheobservationsyMn,{µ }M S i i∈S j j∈/S i i=1 and {π }M . j j=1 Our goal is to build distribution-free tests to detect the outlying sequences. The test can be captured by a universal rule δ :YMn →S, which must not depend on {µ }M and {π }M . i i=1 j j=1 The performance of a universal test is gauged by the maximal probability of error, which is defined as (cid:16) (cid:17) (cid:88) e δ,{µ }M ,{π }M (cid:44)max p (yMn), i i=1 j j=1 S S∈S yMn:δ(yMn)(cid:54)=S and the corresponding error exponent is defined as (cid:16) (cid:17) 1 (cid:16) (cid:17) α δ,{µ }M ,{π }M (cid:44) lim − loge δ,{µ }M ,{π }M . i i=1 j j=1 n→∞ n i i=1 j j=1 A universal test δ is termed universally exponentially consistent if α(cid:0)δ,{µ }M ,{π }M (cid:1)>0, i i=1 j j=1 for µ (cid:54)=π , i,j =1,...,M. i j 3 Generalized Likelihood Test In this section, we consider the case where the typical distributions are identical, i.e., π =π, ∀j ∈/ S. Let j γi denote the empirical distribution of y(i), and is defined as γi(y) (cid:44) m1(cid:12)(cid:12){k = 1,...,m : yk = y}(cid:12)(cid:12), for each y ∈Y. In the universal setting with π and {µ } unknown, conditioned on the set of outliers being S ∈ S, we i i∈S compute the generalized likelihood of yMn by replacing π and {µ } in (2) with their maximum likelihood i i∈S estimates (MLEs) {µˆ } , and πˆ , as i i∈S S pˆuniv =Lˆ (yMn,{µˆ } ,πˆ ). (3) S S i i∈S S The GL test then selects the hypothesis under which the GL is maximized (ties are broken arbitrarily), i.e., δ (yMn)=argmaxpˆuniv. (4) GL S S∈S 3.1 Known Number of Outliers We first consider the case in which the number of outliers, denoted by T ≥1, is known at the outset, i.e., S ={S :S ⊂{1,...,M},|S|=T}. We compute the generalized likelihood of yMn by replacing the µ , i ∈ S and π in (2) with their MLEs: i µˆ (cid:44)γ , and πˆ (cid:44) (cid:80)j∈/Sγj. Then the GL test in (4) is equivalent to i i S M−T (cid:80) δ (yMn)=argmin(cid:88)D(cid:16)γ (cid:107) j∈/Sγj(cid:17), (5) GL j M −T S⊂S j∈/S 3 where D(p(cid:107)q) denotes the KL divergence between two distributions p,q ∈P(Y), defined as D(p(cid:107)q)(cid:44) (cid:88)p(y)log(cid:0)p(y)(cid:1). q(y) y∈Y Proposition 1. [4, Theorem 9, 10] When the number of outliers is known, the GL test in (5) is universally exponentially consistent. As M →∞, the achievable error exponent converges as (cid:16) (cid:17) lim α δ ,{µ }M ,π = lim min 2B(µ ,π), GL i i=1 i M→∞ M→∞i=1,...,M where B(p,q) denotes the Bhattacharyya distance between two distributions p,q ∈P(Y), defined as (cid:16)(cid:88) (cid:17) B(p,q)(cid:44)−log p(y)1/2q(y)1/2 . y∈Y When all outlier sequences are identically distributed, i.e., µ = µ (cid:54)= π, i = 1,...,M, the achievable error i exponent of the GL test in (5) converges to the optimal one achievable when both µ and π are known. Note that the number of hypotheses in the test (5) is (cid:0)M(cid:1). Thus, exhaustive search over all possible T hypotheses has time complexity that is polynomial in M and exponential in T. 3.2 Unknown Number of Identical Outliers In this subsection, we consider the case where the number of outliers is unknown, i.e., S = {S : S ⊂ {1,...,M},1≤|S|<M/2}. Moreover, we assume that the outliers are identically distributed. By replacing the µ , i ∈ S, and π in (2) with their MLEs µˆ = µˆ (cid:44) (cid:80)i∈Sγi, and πˆ (cid:44) (cid:80)j∈/Sγj, the GL i S i |S| S M−|S| test in (4) is equivalent to δ (yMn)=argmin(cid:88)D(cid:16)γ (cid:107)(cid:80)j∈/Sγj(cid:17)+(cid:88)D(cid:16)γ (cid:107)(cid:80)i∈Sγi(cid:17). (6) GL j M −|S| i |S| S⊂S j∈/S i∈S Proposition 2. [4, Theorem 11] When the number of the outliers is unknown, and 1 ≤ |S| < M, the GL 2 test in (6) is universally exponentially consistent. Note that the number of hypotheses in the GL test (6) is (cid:80)(cid:98)M/2(cid:99)(cid:0)M(cid:1), which is exponential in M. The i=1 i complexity of the test in (6) is even larger than that of the test in (5). We also note that when the number of the outliers is unknown and outliers can be distinctly distributed, there cannot exist a universally exponentially consistent test [4, Theorem 12]. 4 Problem Reformulation as Distribution Clustering Inordertoconstructlowtimecomplexityalgorithmswithoutsacrificingmuchinperformance,wereformu- latetheoutlyingsequencedetectionproblemasadistributionclusteringproblem. Althoughthedistribution clustering problem is known to be NP-hard [8], there exists many approximation approaches, e.g. K-means algorithm [11], with time complexity that is linear in M and linear in the number of clusters. The typical distributions are closer to each other than to any of the outlier distributions, and the same also holds for the outlier distributions when the outliers form a cluster. The outliers can be identified by clustering the empirical distributions into two clusters, where the cluster with more members contains all typical sequences, and the other cluster contains outliers. 4 More specifically, if we define the following cost function for distribution clustering K (cid:88) (cid:88) TC = D(γ (cid:107)ck), (7) i k=1i∈Ck where K is the number of clusters, c = {c1,...,cK} are the clustering centers, and disjoint partitions C = {C1,...,CK} denote the cluster assignment. As shown in [6, Proposition 1], for a given cluster assignment{Ck}K ,thetotalcostisminimizedwhenck = (cid:80)i∈Ckpi,whichistheaverageofthedistributions k=1 |Ck| within the k-th cluster. Then the GL test in (6) can be interpreted as a distribution clustering algorithm with K = 2. The first term in (6) corresponds to the cost in the typical cluster and the second term is the cost within the outlier cluster. The GL test decides on the cluster assignment that minimizes the cost among all possible cluster assignments. For the case where the typical distributions are identically distributed, but outliers are not, it suffices to only cluster the empirical distributions of all typical sequences as shown in the GL test (5). Thus,boththeGLtestin(5)and(6)areequivalenttoempiricaldistributionclusteringontheprobability simplex using KL divergence as the distance metric. While the distribution clustering problem itself is known to be NP-hard [8], there are many existing approximation algorithms with low complexity [11]. Here, we introduce the K-means distribution clustering algorithm in [6]. Algorithm 1 K-means distribution clustering algorithm Input: M distributions p1,..., pM defined on Y, number of clusters K. Output: partition set {Ck}K . k=1 Initialization: {c }K (Specified in Algorithm 2 and 3) k k=1 Method: while not convergence do {Assignment Step} Set Ck ←∅, 1≤k ≤K for i=1 to M do Ck ←Ck∪{pi} where k =argmin D(pi(cid:107)ck) k end for {Re-estimation Step} for k =1 to K do ck ← (cid:80)i∈Ckpi |Ck| end for end while Return {Ck}K k=1 ForAlgorithm1,[6]onlyshowsthatthecostfunctionin(7)ismonotonicallydecreasinganditterminates in a finite number of steps at a partition that is locally optimal, i.e., the total loss cannot be decreased by either (a) the assignment step or by (b) changing the means of any existing clusters. 5 Clustering Based Tests Inthissection,weproposelinearcomplexitytestsbasedontheK-meansclusteringalgorithm. Weshowin allcaseswheretheGLtestisexponentiallyconsistent, theclusteringbasedtestsusingKLdivergenceasthe distance metric are also exponentially consistent, while only taking linear time in M. For the case that the 5 typical and outlier distributions form two clusters, we show that the clustering based test is exponentially consistent, but the GL test is not even applicable. 5.1 Known Number of Outliers We first consider the case where the number of outliers T is known and the typical distributions are identical. Algorithm 1 cannot be directly applied here, because the outlier distributions may not form a cluster and Algorithm 1 does not employ the knowledge of T. Motivated by the test in (5), we design Algorithm 2. The novelty of this algorithm lies in the construction of the first clustering center for the typical distribution and an iterative approach based on K-means to update it. Algorithm 2 Clustering with known number of outliers Input: γ ,..., γ , number of the outliers T. 1 M Output: A set of outliers S. Initialization: Choose one distribution γ(0) from γ ,..., γ arbitrarily 1 M for i=1 to M do Compute D(γ (cid:107)γ(0)) i end for πˆ ←γ∗ where D(γ∗(cid:107)γ(0)) is the (cid:100)M(cid:101)-th element among all D(γ (cid:107)γ(0)) 2 i Method: while not convergence do {Assignment Step} Set S ←S∗, where S∗ =argmax (cid:80) D(γ (cid:107)πˆ) S(cid:48)∈S,|S(cid:48)|=T i∈S(cid:48) i {Re-estimation Step} πˆ ← (cid:80)j∈/Sγj M−T end while Return S UsingtheinitializationinAlgorithm2, γ∗ isgeneratedfromπ withhighprobability. Theintuitionbehind this is that: if γ(0) is generated from typical distribution π, then only |S|< M empirical distributions which 2 aregeneratedfromµ arefarfromγ(0);ifγ(0) isgeneratedfromsomeµ ,thenthereareatleastM−|S|> M i i 2 of D(γ (cid:107)γ(0)) concentrating at D(π(cid:107)µ ). Thus the (cid:100)M(cid:101)-th element among D(γ (cid:107)γ(0)) is close to D(π(cid:107)µ ), i i 2 i i and γ∗ is generated from π with high probability. Letδ denotethetestdescribedinAlgorithm2,andδ((cid:96))denotethetestthatruns(cid:96)iterationsinAlgorithm c2 c2 2, where (cid:96)=1,2,... is the number of iterations. Inthefollowingtheorem,weshowthatthetestδ(1) withonlyoneiterationstepisuniversallyexponentially c2 consistent. Theorem 1. For each M ≥ 3, when the number of outliers T is known, the test δ(1), which runs one c2 K-means iteration in Algorithm 2 is universally exponentially consistent. The achievable error exponent of δ(1) can be upper bounded by c2 (cid:16) (cid:17) α δ(1),{µ }M ,π < lim min 2B(µ ,π). (8) c2 i i=1 i M→∞i=1,...,M Furthermore, the time complexity of the test δ(1) is O(M). c2 Proof sketch. Errors made by δ(1) in the initialization step can be decomposed into two scenarios. If γ(0) c2 is generated from typical distribution π, an error occurs when πˆ is actually generated from an outlier 6 distribution. The probability of this event can be upper bounded by the probability of the following event E = {D(γ (cid:107)γ ) < D(γ (cid:107)γ ),∃i ∈ S, ∃ j ,j ∈/ S}. If γ(0) is generated from an outlier distribution, 1 i j1 j2 j1 1 2 the error probability can be upper bounded by the probability of the following event E = {D(γ (cid:107)γ ) < 2 j1 i1 D(γ (cid:107)γ ) < D(γ (cid:107)γ ),∃i ,i ∈ S, ∃ j ,j ∈/ S}. By Sanov’s theorem, we can prove that the probability i2 i1 j2 i1 1 2 1 2 of E and E decay exponentially fast. 1 2 The error probability in the assignment step can be upper bounded by the probability of the same event E , which decays exponentially fast by Sanov’s theorem. 1 Moreover,theassignmentstepinAlgorithm2canbesolvedinlineartimeO(M)[12],whichisindependent of T. The details of the proof can be found in Appendix B. ComparisonofProposition1andTheorem1showsthatδ(1) hasasmallererrorexponentthanthatofthe c2 GL test in (5) as M →∞, but has a linear time complexity. In the following theorem, we further show that the performance of Algorithm 2 will improve with more iterations. Theorem 2. For each M ≥ 3, when the number of outliers T is known, the test δ((cid:96)) is universally expo- c2 nentially consistent. As M →∞, the achievable error exponent of δ((cid:96)) in Algorithm 2 can be lower bounded c2 by (cid:16) (cid:17) (cid:16) (cid:17) lim α δ((cid:96)),{µ }M ,π ≥α δ(1),{µ }M ,π (9) c2 i i=1 c2 i i=1 M→∞ Furthermore, the time complexity of the test δ((cid:96)) is O(M(cid:96)). c2 Proof sketch. It is shown in Theorem 1 and Proposition 1, both the test δ(1) and the GL test δ are c2 GL exponentially consistent, and the error exponent of δ is larger than that of δ(1), when M → ∞. Since GL c2 P(δ(1) (cid:54)=S)≤P(δ(1) (cid:54)=S orδ (cid:54)=S)≤P(δ(1) (cid:54)=S)+P(δ (cid:54)=S),wecanconcludethatP(δ(1) (cid:54)=S orδ (cid:54)= c2 c2 GL c2 GL c2 GL S) also decays exponentially fast with the same error exponent of δ(1), which means that δ and δ(1) both c2 GL c2 output the same correct S with high probability. Given that δ(1) and δ have same outcomes, i.e., the c2 GL initialization achieves the global optimum of the cost function, then running (cid:96) steps will not change the output of δ(1). Thus, δ((cid:96)) at least achieves the error exponent of δ(1). The details of the proof can be found c2 c2 c2 in Appendix C. 5.2 Unknown Number of Identical Outliers In this section, we consider the case where the number of outliers is unknown and both typical and outlier distributions are identical. Motivated by the test in (6), we design the following initialization algorithm to set the clustering centers in Algorithm 1. Algorithm 3 Clustering with unknown number of outliers Input: M empirical distributions γ ,..., γ defined on finite alphabet Y. 1 M Output: A set of outliers S. Initialization: Choose one distribution γ(0) arbitrarily, c1 ←argmax D(γ (cid:107)γ(0)) γi i c2 ←γ(0) Method: Same as in Algorithm 1 with K =2 Return C1 and C2 It can be seen that, c and c chosen by the initialization step in Algorithm 3 are generated by π and µ, 1 2 with high probability. 7 Letδ denotethetestdescribedinAlgorithm3,andδ((cid:96))denotethetestthatruns(cid:96)iterationsinAlgorithm c3 c3 2, where (cid:96) = 1,2,... is the number of iterations. The following theorem shows that the clustering based algorithm δ((cid:96)), is universally exponentially consistent, and has time complexity linear in M. c3 Theorem 3. For each M ≥3, when the number of outliers is unknown, the test δ((cid:96)), which runs (cid:96) steps of c3 Algorithm 3, is exponentially consistent, and has time complexity O(M(cid:96)). Proof sketch. The exponential consistency of δ((cid:96)) can be established using similar techniques to those in c3 Theorem 1 and Theorem 2. The major difference between the proof of Theorem 1 and Theorem 3 is that there are two clustering centers in the initialization step and assignment step in Algorithm 3. The details of the proof can be found in Appendix D. 5.3 Typical and Outlier Distributions Forming Clusters In this subsection, we consider the case that both the outlier distributions {µ } and the typical distri- i i∈S butions {π } are distinct. Moreover, the typical distributions and the outlier distributions form clusters j j∈/S asdefinedin(1), whichmeansthatthedivergencewithinthesameclusterisalwayslessthanthedivergence between different clusters. The following theorem shows that under the condition (1), the one step test δ(1) proposed in Algorithm 3 c3 is universally exponentially consistent, and has time complexity linear in M. Theorem 4. For each M ≥ 3, when both the outlier distributions {µ } and the typical distributions i i∈S {π } formclusters,i.e. condition (1)holds,thetestδ(1),whichrunsonestepofAlgorithm3,isuniversally j j∈/S c3 exponentially consistent, and has time complexity O(M). Proof sketch. The exponential consistency of δ(1) can be established using similar techniques as shown in c3 Theorem 3. The details of the proof can be found in Appendix E. The GL approach of replacing the true distribution in (2) by their MLEs leads to identical likelihood estimatesforeachhypothesis. Thus,theGLapproachisnotapplicablehere. Onecouldapplythetestin(6) to this problem, but it can be shown (see Appendix F) that the test in (6) is not universally exponentially consistent, even if condition (1) holds. 6 Numerical Results Inthissection, wecomparetheperformanceofthe GLtestδ , theclusteringbasedtestsδ , δ andthe GL c2 c3 onesteptestsδ(1), δ(1). Weconsiderthecasewithidenticaltypicaldistribution. Wesetπ tobetheuniform c2 c3 distribution with alphabet size 10, and generate outlier distributions randomly. We first simulate the case with distinct outliers where T is known. We choose M = 20 sequences and T =3 outliers. In Fig. 1, we plot logP as a function of n for δ , δ and δ(1). As we can see from Fig. 1, e GL c2 c2 δ(1) andδ arebothexponentiallyconsistent,andδ outperformsδ(1) asshowninTheorem2. Comparison c2 c2 c2 c2 between δ and δ shows that without sacrificing much in performance, δ is about 50 times faster than c2 GL c2 δ . GL We then simulate the case with unknown number of identical outliers. We set M = 100 sequences and T = 10 outliers. Fig. 2 shows that δ outperforms δ(1). We note that running the clustering based tests c3 c3 for 5000 times takes 5 minutes on a 3.6 GHz i7 CPU. However, the GL test is not feasible here, since the number of hypotheses needs to search over is exponential in M. 8 Figure 1: Comparison of test δ(1), δ , δ , known number of distinct outliers c2 c2 GL Figure 2: Comparison of test δ , δ(1), unknown number of identical outliers c3 c3 9 Appendix A Useful Lemmas Our proofs rely on the following lemmas provided in [13]. Lemma 1. [13, Lemma 1] Let Y(1),...,Y(J) be mutually independent random vectors with each Y(j), j = 1,...,J , being n i.i.d. repetitions of a random variable distributed according to p ∈ P(Y). Let A j n be the set of all J tuples (y(1),...,y(J))∈YJn whose empirical distributions (γ ,...,γ ) lie in a closed set 1 J E ∈P(Y)J . Then, it holds that (cid:26)(cid:18) (cid:19) (cid:27) J 1 (cid:88) lim − logP Y(1),...,Y(J) ∈A = min D(q (cid:107)p ). (10) n→∞ n n (q1,...,qJ)∈Ej=1 j j Lemma 2. [13, Lemma 2] For any two pmfs p ,p ∈P(Y) with full supports, it holds that 1 2 (cid:16) (cid:17) 2B(p ,p )= min D(q(cid:107)p )+D(q(cid:107)p ) . (11) 1 2 1 2 q∈P(Y) In particular, the minimum on the right side is achieved by p1/2(y)p1/2(y) q∗ = 1 1 , y ∈Y. (12) (cid:80) p1/2(y)p1/2(y) y∈Y 1 1 B Proof of Theorem 1 Due to the structure of the test we know that errors occur at two different steps: 1. Initialization Step: The constructed clustering center for typical sequences πˆ is actually generated from the outlier distribution. 2. Assignment Step: Given correct clustering center πˆ, the empirical distribution of the outlying se- quence is more close to the clustering center πˆ. We first use the event E to denote that errors occur at the initialization step. Since we use the γ∗ as the clusteringcenterπˆ,whereD(γ∗(cid:107)γ(0))isthe(cid:100)M(cid:101)-thelementamongallD(γ (cid:107)γ(0)). Itisdifficulttowritethe 2 i explicit form of the event E. However, we can find upper bounds for the probability of E in the following two scenarios. Ifγ(0) isgeneratedfromtypicaldistributionπ,anerroroccurswhenπˆ isactuallygeneratedfromanoutlier distribution, thenD(γ (cid:107)γ(0))≤D(γ (cid:107)γ(0))mustholdforsomei∈S,j ∈/ S. Duetothearbitrarinessofγ(0), i j the probability of this error event can be upper bounded by the probability of the following event E (cid:44){D(γ (cid:107)γ )≤D(γ (cid:107)γ ),∃i∈S, ∃ j ,j ∈/ S}. (13) 1 i j1 j2 j1 1 2 If γ(0) is generated from an outlier distribution, the error probability can be upper bounded by the probability of the following event E (cid:44){D(γ (cid:107)γ )<D(γ (cid:107)γ )<D(γ (cid:107)γ ),∃i ,i ∈S, ∃ j ,j ∈/ S}. (14) 2 j1 i1 i2 i1 j2 i1 1 2 1 2 Thus, P(E)≤P(E )+P(E ). 1 2 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.