ebook img

Separating populations with wide data: A spectral analysis PDF

0.47 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Separating populations with wide data: A spectral analysis

Electronic Journal of Statistics Vol.3(2009)76–113 ISSN:1935-7524 DOI:10.1214/08-EJS289 Separating populations with wide data: A spectral analysis Avrim Blum 9 ∗ 0 Computer ScieneDepartement 0 Carnegie Mellon University 2 5000 Forbes Ave Pittsburgh, PA 15213, USA n e-mail: [email protected] a J Amin Coja-Oghlan 9 † 2 Universityof Edinburgh Informatics Forum ] 10 Crichton Street L Edinburgh EH8 9AB, UK M e-mail: [email protected] . t Alan Frieze a ‡ t s Departement of Mathmatics [ Carnegie Mellon University 5000 Forbes Ave 2 Pittsburgh, PA 15213, USA v e-mail: [email protected] 4 3 Shuheng Zhou 4 § 3 Seminar fu¨r Statistik . ETH Zentrum, 6 CH-8092 Zu¨rich, Switzerland 0 e-mail: [email protected] 7 0 : v i X Abstract: Inthispaper, weconsider theproblem ofpartitioning asmall r datasampledrawnfromamixtureofkproductdistributions.Weareinter- a estedinthecasethatindividualfeaturesareoflowaverage qualityγ,and wewanttouseasfewofthemaspossibletocorrectlypartitionthesample. Weanalyzeaspectraltechniquethatisabletoapproximatelyoptimizethe total data size—the product of number of data points n and the number offeatures K—neededtocorrectlyperformthispartitioningasafunction of 1/γ for K > n. Our goal is motivated by an application in clustering individualsaccordingtotheirpopulationoforiginusingmarkers,whenthe divergencebetween anytwoofthepopulations issmall. ∗SupportedinpartbytheNSFundergrantCCF-0514922 †SupportedbytheGermanResearchFoundation undergrantCO646 ‡SupportedinpartbytheNSFundergrantCCF-0502793 §SupportedinpartbytheNSFundergrantsCCF-0625879andCNF–0435382 76 A. Blum et al./Separating populations with wide data: A spectral analysis 77 AMS 2000 subject classifications: Primary60K35, 60K35; secondary 60K35. Keywords and phrases: mixture of product distributions, clustering, smallsample,spectralanalysis. ReceivedSeptember 2008. Contents 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 2 A simple algorithm using singular vectors . . . . . . . . . . . . . . . . 82 2.1 Generating a large gap in s ( ),s ( ) . . . . . . . . . . . . . . . 85 1 2 X X 2.2 Detailed analysis for the simple algorithm . . . . . . . . . . . . . 86 2.3 Proof of Theorem 2.4. . . . . . . . . . . . . . . . . . . . . . . . . 87 2.4 Correctness of classification for the simple algorithm . . . . . . . 88 3 The algorithm Partition . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.2 Description of the algorithm . . . . . . . . . . . . . . . . . . . . . 93 3.3 Proof of Lemma 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.4 Proof of Lemma 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . 98 4 Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 A More Proofs for the simple algorithm classify . . . . . . . . . . . . . . 101 A.1 Proof of Lemma 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . 101 A.2 Some Propositions regarding the static matrices . . . . . . . . . . 104 A.3 Proofs of Proposition 2.5 and 2.6 . . . . . . . . . . . . . . . . . . 106 A.4 Proof of Lemma 2.7 . . . . . . . . . . . . . . . . . . . . . . . . . 108 B Proof of Lemma 2.16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 1. Introduction We explore atype ofclassificationproblemthatarisesin thecontextofcompu- tationalbiology.Theproblemis thatwearegivenasmallsample ofsizen,e.g., DNAofnindividuals(think ofninthe hundredsorthousands),eachdescribed by the values of K features or markers, e.g., SNPs (Single Nucleotide Polymor- phisms, think of K as an order of magnitude larger than n). Our goal is to use these featurestoclassify the individualsaccordingto their populationoforigin. Featureshaveslightlydifferentprobabilitiesdependingonwhichpopulationthe individualbelongsto,andareassumedtobeindependentofeachother(i.e.,our data is a small sample from a mixture of k very similar product distributions). The objective we consider is to minimize the total data size D = nK needed to correctly classify the individuals in the sample as a function of the “average quality” γ of the features, under the assumption that K > n. Throughout the paper, we use pj and µj as shorthands for p(j) and µ(j) respectively. i i i i A. Blum et al./Separating populations with wide data: A spectral analysis 78 Statistical Model: We have k probability spaces Ω ,...,Ω over the set 1 k 0,1 K. Further, the components (features) of z Ω are independent and t { } ∈ Pr [z =1] = pi (1 t k, 1 i K). Hence, the probability spaces Ωt i t ≤ ≤ ≤ ≤ Ω ,...,Ω comprise the distribution of the features for each of the k popula- 1 k tions. Moreover, the input of the algorithm consists of a collection (mixture) of n = k N unlabeled samples, N points from Ω , and the algorithm is t=1 t t t to determine for each data point from which of Ω ,...,Ω it was chosen. In 1 k P generalwedonot assumethatN ,...,N arerevealedtothe algorithm;butwe 1 t do require some bounds on their relative sizes. An important parameter of the probability ensemble Ω ,...,Ω is the measure of divergence 1 k K (pi pi)2 γ = min i=1 s− t (1.1) 1 s<t k K ≤ ≤ P betweenanytwodistributions.Notethat√Kγ measurestheEuclideandistance betweenthemeansofanytwodistributionsandthusrepresentstheirseparation. Further, let N = n/k (so if the populations were balanced we would have N of each type) and assume from now on that kN <K. Let D = nK denote the size ofthe data-set.Inaddition,let σ2 =max pi(1 pi)denote the maximum i,t t − t variance of any random bit. The biological context for this problem is we are given DNA information from n individuals from k populations of origin and we wish to classify each individual into the correct category. DNA contains a series of markers called SNPs, each of which has two variants (alleles). Given the population of origin of an individual, the genotypes can be reasonably assumed to be generated by drawing alleles independently from the appropriate distribution. The following theorem gives a sufficient condition for a balanced (N = N ) input instance 1 2 when k=2. Theorem 1.1 (Zhou (2006)). Assume N = N = N. If K = Ω lnN and 1 2 γ KN =Ω lnNloglogN then with probability 1 1/poly(N), among all balanced γ2 − (cid:0) (cid:1) cuts in the complete graph formed among 2N sample individuals, the maximum (cid:0) (cid:1) weight cut corresponds to the partition of the 2N individuals according to their population of origin. Here the weight of a cut is the sum of weights across all edges in the cut, and the edge weight equals the Hamming distance between the bit vectors of the two endpoints. Variants of the above theorem, based on a model that allows two random draws from each SNP for an individual, are given in Chaudhuri et al. (2007); Zhou(2006).Inparticular,noticethatedgeweightsbasedonthe inner-product of two individuals’ bit vectors correspond to the sample covariance, in which case the max-cut corresponds to the correct partition Zhou (2006) with high probability.Findingamax-cutiscomputationallyintractable;henceinthesame paper Chaudhuri et al. (2007), a hill-climbing algorithm is given to find the correct partition for balanced input instances but with a stronger requirement on the sizes of both K and nK. A. Blum et al./Separating populations with wide data: A spectral analysis 79 A Spectral Approach: In this paper, we construct two simpler algorithms using spectral techniques, attempting to reproduce conditions above.In partic- ular,we study the requirements onthe parametersof the model (namely, γ,N, k,andK)thatallowustoclassifyeveryindividualcorrectlyandefficientlywith high probability. ThetwoalgorithmsClassifyandPartitioncompareasfollows.Bothalgo- rithmsarebasedonspectralmethodsoriginallydevelopedingraphpartitioning. More precisely, Theorem 1.2 is based on computing the singular vectors with the two largestsingularvalues for eachof the n K input randommatrix.The × procedure is conceptually simple, easy to implement, and efficient in practice. Forsimplicity,ProcedureClassifyassumestheseparationparameterγ isknown to decide which singular vector to examine; in practice, one can just try both singular vectors as we do in the simulations. Proof techniques for Theorem 1.2, however,are difficult to apply to cases of multiple populations, i.e., k >2. Pro- cedure Partition is based on computing a rank-k approximation of the input random matrix and can cope with a mixture of a constant number of popula- tions. It is more intricate for both implementation and execution than Classify. It does not require γ as an input, while only requires that the constant k is given. We prove the following theorems. Theorem 1.2. Let ω = min(N1,N2) and ω be a lower bound on ω. Let γ be n min given. Assume that K > 2nlnn and k = 2. Procedure Classify allows us to separate two populations w.h.p., when n Ω σ2 , where σ2 is the largest ≥ γωminω variance of any random bit, i.e. σ2 =max pi(1 pi). Thus if the populations i,t (cid:0)t − t(cid:1) are roughly balanced, then n c suffices for some constant c. ≥ γ This implies that the data required is D = nK = O lnnσ4/γ2ω2ω2 . Let min P =(pi) , we have s s i=1,...,K (cid:0) (cid:1) K σ P P = Kγ = (pi pi)2 √lnn. (1.2) k 1− 2k2 v 1− 2 ≥ ω ω ui=1 min p uX t Theorem 1.3. Let ω = min(N1,...,Nk). There is a polynomial time algorithm n Partition that satisfies the following. Suppose that K > nlogn, γ > K 2, − n > Cγkωσ2 for some large enough constant Ck, and ω = Ω(1). Then given the empirical n K matrix comprising the K features for each of the n individuals × along with the parameter k, Partition separates the k populations correctly w.h.p. Summary and Future Direction: Note that unlike Theorem 1.1, both The- orem 1.2 and Theorem 1.3 require a lower bound on n, even when k = 2 and the input instance is balanced. We illustrate through simulations to show that this seems not to be a fundamental constraint of the spectral techniques; our experimental results show that even when n is small, by increasing K so that nK =Ω(1/γ2),onecanclassifyamixtureoftwopopulationsusingideasinPro- cedureClassifywithsuccessratereachingan“oracle”curve,whichiscomputed A. Blum et al./Separating populations with wide data: A spectral analysis 80 assuming that distributions are known, where success rate means the ratio be- tween correctly classified individuals and N. Exploring the tradeoffs of n and K that are sufficient for classification, when sample size n is small, is both of theoretical interests and practical value. Outline of the paper: The paper is organized as follows. In Section 1.1 we discussrelatedwork.Then,inSection2wedescribethealgorithmClassifyfor Theorem1.2andoutlineitsanalysis.Some(very)technicaldetailsoftheanaly- sis are deferred to the appendix. Section 3 deals with the algorithmPartition for Theorem 1.3. Finally, in Section 4 we report some experimental results on Classify. 1.1. Related work In their seminal paper Pritchardet al. (2000), Pritchard, Stephens, and Don- nelly presented a model-based clustering method to separate populations using genotype data. They assume that observations from each cluster are random from some parametric model. Inference for the parameters corresponding to each population is done jointly with inference for the cluster membership of each individual, and k in the mixture, using Bayesian methods. The idea of exploiting the eigenvectors with the first two eigenvalues of the adjacency matrix to partition graphs goes back to the work of Fiedler Fiedler (1973), and has been used in the heuristics for various NP-hard graph parti- tioning problems (e.g., Fjallstrom (1998)). The main difference between graph partitioning problems and the classification problem that we study is that the matrices occurring in graph partitioning are symmetric and hence diagonaliz- able,while ourinput matrixisrectangularingeneral.Thus,the contributionof Theorem1.2istoshowthataconceptuallysimpleandefficientalgorithmbased onsingularvaluedecompositionsperformswellintheframeworkofafairlygen- eralprobabilistic model, where probabilities for eachof the K features for each of the k populations are allowed to vary. Indeed, the analysis of Classify re- quiresexploringnewideassuchastheSeparationLemmaandthenormalization of the random matrix X, for generating a large gap between top two singular valuesoftheexpectationmatrix andforboundingtheanglebetweenrandom X singular vectors and their static correspondents, details of which are included in Section 2 with analysis in full version. Procedure Partition and its analysis build upon the spectral techniques of McSherry (2001) on graph partitioning, and an extension due to Coja-Oghlan (2006). McSherry provides a comprehensive probabilistic model and presents a spectralalgorithmfor solving the partitioning problemon randomgraphs,pro- vided that a separationcondition similar to (1.2) is satisfied. Indeed, McSherry (2001)encompassesaconsiderableportionofthepriorworkonGraphColoring, Minimum Bisection, and finding Maximum Clique. Moreover, McSherry’s ap- proach easily yields an algorithm that solves the classification problem studied in the present paper under similar assumptions as in Theorem 1.3, provided A. Blum et al./Separating populations with wide data: A spectral analysis 81 that the algorithm is given the parameter γ as an additional input; this is ac- tually pointed out in the conclusions of McSherry (2001). In the context of graph partitioning, an algorithm that does not need the separation parameter as an input was devised in Coja-Oghlan (2006). The main difference between Partition and the algorithm presented in Coja-Oghlan (2006) is that Parti- tion deals with the asymmetric n K matrix of individuals/features, whereas × Coja-Oghlan(2006) deals with graph partitioning (i.e., a symmetric matrix). There are two streams of related work in the learning community. The first stream is the recent progress in learning from the point of view of cluster- ing: given samples drawn from a mixture of well-separated Gaussians (compo- nent distributions), one aims to classify each sample according to which com- ponent distribution it comes from, as studied in Dasgupta (1999), Dasgupta and Schulman (2000), Arora and Kannan (2001); Vempala and Wang (2002); Achlioptas and McSherry(2005); Kannan et al.(2005);Dasgupta et al.(2005). This framework has been extended to more general distributions such as log- concavedistributions in Achlioptas and McSherry(2005); Kannan et al.(2005) andheavy-taileddistributionsinDasgupta et al.(2005),aswellastomorethan twopopulations.Theseresultsfocusmainlyonreducingtherequirementonthe separations between any two centers P and P . In contrast, we focus on the 1 2 sample size D. This is motivated by previous results Chaudhuri et al. (2007); Zhou (2006) stating that by acquiring enough attributes along the same set of dimensions from each component distribution, with high probability, we can correctly classify every individual. Whileouraimisdifferentfromthoseresults,wheren>K isalmostuniversal and we focus on cases K > n, we do have one common axis for comparison, the ℓ -distance between any two centers of the distributions. In earlier works 2 Dasgupta and Schulman (2000); Arora and Kannan (2001), the separation re- quirementdepended onthe number of dimensions ofeachdistribution; this has recently been reduced to be independent of K, the dimensionality of the dis- tribution for certain classes of distributions Achlioptas and McSherry (2005); Kannan et al. (2005). This is comparable to our requirement in (1.2) for the discrete distributions. For example, according to Theorem 7 in Achlioptas and McSherry (2005), in order to separate the mixture of two Gaussians, σ P P =Ω +σ logn (1.3) k 1− 2k2 √ω (cid:18) (cid:19) p is required. Besides Gaussian and Logconcave, a general theorem: Theorem 6 in Achlioptas and McSherry (2005) is derived that in principle also applies to mixtures of discrete distributions. The key difficulty of applying their theo- rem directly to our scenario is that it relies on a concentration property of the distribution (Eq. (10) of Achlioptas and McSherry (2005)) that need not hold in our case. In addition, once the distance between any two centers is fixed (i.e., once γ is fixed in the discrete distribution), the sample size n in their al- gorithms is always larger than Ω K log5K Achlioptas and McSherry (2005); ω Kannan et al. (2005) for log-concave distributions (in fact, in Theorem 3 of (cid:0) (cid:1) Kannan et al. (2005), they discard at least this many individuals in order to A. Blum et al./Separating populations with wide data: A spectral analysis 82 correctly classify the rest in the sample), and larger than Ω(K) for Gaussians ω Achlioptas and McSherry (2005), whereas in our case, n < K always holds. Hence, our analysis allows one to obtain a clean bound on n in the discrete case. The second stream of work is under the PAC-learning framework, where givena samplegeneratedfromsometargetdistributionZ,the goalisto output a distribution Z that is close to Z in Kullback-Leibler divergence:KL(Z Z ), 1 1 || whereZ isamixtureofproductdistributionsoverdiscretedomainsorGaussians Kearns et al. (1994); Freund and Mansour (1999); Cryan (1999); Cryan et al. (2002); Mossel and Roch (2005); Feldman et al. (2005, 2006). They do not re- quireaminimaldistance betweenanytwodistributions,buttheydonotaimto classify every sample point correctly either, and in general require much more data. Our workis alsorelatedto the use ofprincipal component analysis (“PCA”) in genetics Patterson et al. (2006); Price et al. (2006). The basic approach in these papers is to use the eigenvectors of a covariance matrix between samples to analyze a mixture of populations. While Pattersonet al. (2006); Price et al. (2006) study the use of spectral methods empirically, the crucial point of the present work is that we prove rigorously that spectral methods succeed on a certain (simple) probabilistic model. Hence, our work can be seen as a further theoretical justification of the practical use of PCA. A difference between the presentpaperandPatterson et al.(2006);Price et al.(2006)isthatweactually aim to assign each individual to exactly one of the populations. By contrast, Pattersonet al. (2006); Price et al. (2006) just assign each individual a real “weight” for each population: essentially the eigenvectors with the dominant eigenvalues corresponding to the populations, and each individual is assigned itsprojectiononthesedominanteigenvectors.ThealgorithmClassifyissome- whatsimilartoPCA,butPartitionisconceptuallymoreinvolved.Inaddition, our experimental results show a phase transition phenomenon similar to what was observed in Patterson et al. (2006) in detecting population structure using simulated data. 2. A simple algorithm using singular vectors As described in Theorem 1.2, we assume we have a mixture of two product distributions. Let N ,N be the number of individuals from each population 1 2 class. Our goal is to correctly classify all individuals according to their distri- butions. Let n = 2N = N +N , and refer to the case when N = N as the 1 2 1 2 balanced input case. For convenience, let us redefine “K” to assume we have O(logn) blocks of K features each (so the total number of features is really O(Klogn)) and we assume that each set of K features has divergence at least γ. (If we perform this partitioning of features into blocks randomly, then with highprobability this divergencehas changedby only a constantfactor for most blocks.) The high-levelideaofthe algorithmisnowtorepeatthe followingprocedure foreachblockofKfeatures:usetheKfeaturestocreateann KmatrixX,such × A. Blum et al./Separating populations with wide data: A spectral analysis 83 that each row X ,i = 1,...,n, corresponds to a feature vector for one sample i point, across its K dimensions. We then compute the top two left singular vectors u ,u of X and use these to classify each sample. This classification 1 2 induces some probability of error f for each individual at each round, so we repeat the procedure for each of the O(logn) blocks and then take majority vote over different runs. Each round we require K n features, so we need ≥ O(nlogn) features total in the end. In more detail, we repeat the following procedure O(logn) times. Let T = 15N√3ω γ,whereω isthelowerboundontheminimumweightmin N1, N2 , 32 min min {2N 2N} whichisindependentofanactualinstance.Lets (X),s (X)bethetoptwosin- 1 2 gular values of X. Procedure Classify: Given γ,N,ω . Assume that N 1, min ≫ γ Normalization:usetheK featurestoformarandomn K matrixX;Each • × individualrandomvariableX isanormalized randomvariablebasedon i,j the original Bernoulli r.v. b 0,1 with Pr[b =1]= pj for X P i,j ∈ { } i,j 1 i ∈ 1 and Pr[b =1]=pj for X P , such that X = b+1. i,j 2 i ∈ 2 i,j 2 Taketoptwoleftsingularvectorsu ,u ofX,whereu =[u ,...,u ],i= 1 2 i i,1 i,n • 1,2. 1. If s (X) > T = 15N√3ω γ, use u to partition the individuals 2 32 min 2 with 0 as the threshold, i.e., partition j [n] according to u < 0 2,j ∈ or u 0. 2,j ≥ 2. Otherwise, use u to partition, with mixture mean M = n u 1 i=1 1,n as the threshold. P Analysis of the Simple Algorithm: Our analysis is based on comparing entries in the top two singular vectors of the normalized random n K ma- × trix X, with those of a static matrix , where each entry = E[X ] is i,j i,j X X the expected value of the corresponding entry in X. Hence i = 1,...,N , 1 = [µ1,µ2,...,µK], where µj = 1+pj1, j, and i = N +∀1,...,n, = Xi 1 1 1 1 2 ∀ ∀ 1 Xi [µ1,µ2,...,µK], where µj = 1+pj2, j. We assume the divergence is exactly γ 2 2 2 2 2 ∀ among the K features that we have chosen in all calculations. TheinspirationforthisapproachisbasedonLemma2.1,whoseproofisbuilt upon Theorem A.2 that is presented in a lecture note by Spielman (2002). For ann K matrixA,lets (A) s (A) s (A)besingularvaluesofA.Let 1 2 n × ≥ ≥···≥ u ,...,u ,v ,...,v ,bethenleftandrightsingularvectorsofX,corresponding 1 n 1 n to s (X),...,s (X) such that u = 1, v = 1, i. We denote the set of n 1 n k ik2 k ik2 ∀ left and right singular vectors of with u¯ ,...,u¯ ,v¯ ,...,v¯ . 1 n 1 n X Lemma 2.1. Let X be the random n K matrix and its expected value × X matrix. Let A = X be the zero-mean random matrix. Let θ be the angle i −X between two vectors: [u ,v ], [u¯ ,v¯], where [u ,v ] = [u¯ ,v¯] =2 and [u,v] i i i i k i i k2 k i i k2 A. Blum et al./Separating populations with wide data: A spectral analysis 84 represents a vector that is the concatenation of two vectors u,v. 4s (A) 1 u u¯ [u ,v ] [u¯ ,v¯] 2θ 2sin(θ ) , (2.1) k i− ik2 ≤k i i − i i k2 ≈ i ≈ i ≤ gap(i, ) X where gap(i, )=min s ( ) s ( ). j=i i j X 6 | X − X | We first bound the largest singular value s (A) = s (X ) of (a ) with 1 1 i,j −X independent zero-mean entries, which defines the Euclidean operator norm (a ) :=sup a x y : x2 1, y2 1 . (2.2) k i,j k i,j i j i ≤ i ≤ (cid:26) i,j (cid:27) X X X The behaviorof the largestsingular value ofan n m randommatrices A with × i.i.d. entries is well studied. Latala (2005) shows that the weakest assumption foritsregularbehaviorisboundednessofthefourthmomentoftheentries,even if they arenot identically distributed.Combining Theorem2.2ofLatala(2005) with the concentrationTheorem2.3by Meckes(2004)provesTheorem2.4 that we need 1. Theorem 2.2 (NormofRandomMatricesLatala(2005)). For any finiten m × matrix A of independent mean zero r.v.’s a we have, for an absolute constant i,j C, 1 4 E (a ) C max Ea2 +max Ea2 + Ea4 . (2.3) k i,j k≤  i i,j j i,j i,j  s j s i (cid:18)i,j (cid:19) X X X   Theorem 2.3 (Concentration of Largest Singular Value: Bounded Range Meckes (2004)). For any finite n m, where n m, matrix A, such × ≤ that entries a are independent r.v. supported in an interval of length at most i,j D, then, for all t, Pr[s (A) Ms (A) t] 4e t2/4D2. (2.4) 1 1 − | − |≥ ≤ Theorem 2.4 (Largest Singular Value of a Mean-zero Random Matrix). For any finite n K,where n K, matrix A, such that entries a are independent i,j × ≤ mean zeror.v. supportedin an interval oflength atmost D, withfourthmoment upper bounded by B, then Pr s (A) CB1/4√K+4D√π+t 4e t2/4 (2.5) 1 − ≥ ≤ h i for all t. Hence A C B1/4√K for an absolute constant C . 1 1 k k≤ 1 OnecanalsoobtainanupperboundofO(√n+K)ons1(A)usingatheoremonbyVu (2005),throughtheconstructiona(n+K) (n+K)squarematrixoutofA. × A. Blum et al./Separating populations with wide data: A spectral analysis 85 2.1. Generating a large gap in s1(X),s2(X) InordertoapplyLemma2.1tothetoptwosingularvectorsofX and through X 4s (X ) 1 u u¯ −X (2.6) k 1− 1k2 ≤ s ( ) s ( ) 1 2 | X − X | 4s (X ) 1 u u¯ −X , (2.7) k 2− 2k2 ≤ min(s ( ) s ( ) , s ( )) 1 2 2 | X − X | | X | we need to first bound s ( ) s ( ) away from zero, since otherwise, RHSs 1 2 | X − X | on both (2.6) and (2.7) become unbounded. We then analyze gap(2, )=min(s ( ) s ( ) , s ( )). 1 2 2 X | X − X | | X | Let us first define values a,b,c that we use throughout the rest of the paper: K K K a= (µk)2, b= µkµk, c= (µk)2. (2.8) 1 1 2 2 k=1 k=1 k=1 X X X For the following analysis, we can assume that a,b,c [K/4,K], given that X ∈ is normalized in Procedure Classify. We first show that normalization of X as described in Procedure Classify guaranteesthat not only s ( ) s ( ) =0, but there also exists a Θ(√NK) 1 2 | X − X |6 amount of gap between s ( ) and s ( ) in Proposition 2.5: 1 2 X X gap( ):= s ( ) s ( ) =Θ(√NK). (2.9) 1 2 X | X − X | Proposition2.5. ForanormalizedrandommatrixX,itsexpectedvaluematrix X satisfies 4c0√52NK ≤ gap(X) ≤ √2NK, where c0 = K|b(|a√+acc) is a constant, given that a,b,c [K/4,K] as defined in (2.8). In addition, ∈ KN NK s ( ) √2NK, and s ( )+s ( ) √2NK. (2.10) 1 1 2 4 ≤ X ≤ 2 ≤ X X ≤ r r We next state a few important results that justify Procedure Classify. Note that the left singular vectors u¯ , i of are of the form [x ,...,x ,y ,...,y ]T: i i i i i ∀ X u¯ =[x ,...,x ,y ,...,y ]T, and u¯ =[x ,...,x ,y ,...,y ]T, (2.11) 1 1 1 1 1 2 2 2 2 2 wherex repeatsN timesandy repeatsN times.WefirstshowProposition2.6 i 1 i 2 regarding signs of x ,y ,i=1,2, followed by a lemma bounding the separation i i of x ,y . We then state the key Separation Lemma that allows us to conclude 2 2 that least one of top two left singular vectors of X can be used to classify data at each round. It can be extended to cases when k >2. Proposition 2.6. Let b as defined in (2.8): when b > 0, entries x ,y in u¯ 1 1 1 have the same sign while x ,y in u¯ have opposite signs. 2 2 2

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.