ebook img

Feature Screening in Large Scale Cluster Analysis PDF

0.94 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Feature Screening in Large Scale Cluster Analysis

Feature Screening in Large Scale Cluster Analysis TRAMBAK BANERJEE, GOURAB MUKHERJEE AND PETER RADCHENKO University of Southern California 7 1 0 2 n a Abstract J 1 Weproposeanovelmethodologyforfeaturescreeninginclusteringmassivedatasets, 1 in which both the number of features and the number of observations can potentially ] E be very large. Taking advantage of a fusion penalization based convex clustering M criterion, we propose a very fast screening procedure that efficiently discards non- . t a informative features by first computing a clustering score corresponding to the clus- t s [ teringtreeconstructedforeachfeature,andthenthresholdingtheresultingvalues. We 1 provide theoretical support for our approach by establishing uniform non-asymptotic v 7 bounds on the clustering scores of the “noise” features. These bounds imply perfect 5 8 screeningofnon-informativefeatureswithhighprobabilityandarederivedviacareful 2 analysis of theempirical processes corresponding to the clustering trees that are con- 0 . 1 structed for each of the features by the associated clustering procedure. Through ex- 0 7 tensivesimulationexperimentswecomparetheperformanceofourproposedmethod 1 : with other screening approaches, popularly used in cluster analysis, and obtain en- v i couragingresults. Wedemonstrateempiricallythatourmethodisapplicabletocluster X r analysisofbigdatasetsarisinginsingle-cellgeneexpressionstudies. a Some key words: Convex clustering; Non-asymptotic screening rate; Modality detection; High- dimensionality;EmpiricalProcesses;Single-cellbiology;RNA-Seqdata 1 Introduction We consider the problem of feature screening in large scale cluster analysis. Clustering is one of the most popular unsupervised classification techniques and is widely used in a 1 myriadofstatisticalapplicationsforstratificationandsub-populationidentification(Harti- gan & Wong, 1979; Farcomeni & Greco, 2016; Rousseeuw & Kaufman, 1990). In recent years, due to massive advancements in the modern data collection and assimilation tech- niques, very big datasets, with both a large number of observations and a large number of features,havebeengeneratedwithincreasingfrequency. Classicalclusteringmethods(see, for example, Chapter 14 of Friedman et al., 2001) are either computationally challenging or ineffective for conducting segmentation analysis of such massive modern data. In many scientificapplications,whenthedimensionofthedataisveryhigh,mostofthecoordinates (i.e. features) contain very little information regarding the grouping structure. Classical clustering methods, which do not reduce the dimension of the data, suffer, because the agglomerative effects of the large number of “noise” features conceal important clustering information available in a relatively smaller number of “signal” features. Recently devel- oped clustering algorithms, such as Witten & Tibshirani (2012); Arias-Castro & Verzelen (2014); Chan & Hall (2012); Jin & Wang (2014), which exploit the underlying sparse- ness, are effective in dealing with high-dimensional data. Herein we propose a scalable computationally efficient approach, entitled COSCI (COnvex Screening for Cluster Infor- mation), that can efficiently weed out the features that are non-informative for clustering. As a non-parametric approach, COSCI has competitive advantages over the popular Gaus- sian mixture based parametric techniques (Arias-Castro & Verzelen, 2014; Azizyan et al., 2013). Unlike the non-parametric density estimation based screening techniques, COSCI is very scalable, and can successfully handle datasets with more than one million obser- vations. Our proposed procedure discards non-informative features by first computing a clustering score for the clustering tree constructed for each feature, and then thresholding theresultingvalues. Weprovidethetheoreticalmotivationforourapproachbyestablishing uniformnon-asymptoticboundsontheclusteringscoresofthenoisefeatures. OurtheoreticalresultsareasignificantextensionoftheunivariateresultsinRadchenko &Mukherjee(2014). Herein,werelyonamorecarefulanalysisoftheempiricalprocesses corresponding to the clustering trees constructed for each feature by the associated clus- tering procedure. We derive a stronger tail probability bound for the clustering score of eachfeature,whichthenallowsustoestablishauniformboundforalltheclusteringscores of the non-informative features in the high-dimensional setting, where the number of such features can be extremely large. Using this uniform bound, we infer that under mild regu- 2 larityconditionsonthepopulationdistributiontheproposedCOSCIalgorithmwilldiscard the non-informative features and perfectly select the informative features with very high probability. High dimensional datasets arising in modern biology, econometrics, engineering, text mining and signal processing (James et al., 2013) require a significant degree of feature screening for subsequent application of a clustering algorithm. We illustrate the applica- bility of COSCI for making scientific discoveries through the analysis of single cell bio- logical data. Emerging technologies, such as single-cell Mass cytometry (Bendall et al., 2011), next-generation sequencing (Liu et al., 2012) and micro-fluidic methods (Dalerba et al., 2011; White et al., 2011), have recently enabled us to collect gene and protein ex- pressioninformationforeachcell(Wang&Bodovitz,2010). Theresultingdatasets,which are not only very high-dimensional but also contain a large number of cellular observa- tions, serve as invaluable resources for the characterization of the cellular hierarchy in multi-cellular organisms. Often, the biological question associated with these datasets is the identification of homogeneous cellular sub-populations based on the differential ex- pression patterns of the genes. The composition of these sub-populations is subsequently analyzedtodetectinterestingstructures. Recently,severalalgorithmssuchasSPADE(Qiu et al., 2011), viSNE (Amir et al., 2013), Wanderlust (Bendall et al., 2014), Scaffold maps (Spitzer et al., 2015), SLIDE (Sen et al., 2015) and ECLAIR (Giecold et al., 2016) have been developed for conducting such sub-population analysis for single cell data. Most of these methods conduct either dimension reduction through PCA related methods or han- dle large sample sizes via down-sampling. While these algorithms are widely used, they lack appropriate mathematical guarantees to show that the resulting sub-populations are notduetorandomfluctuationsandwillbereproducibleacrossdatasetsgeneratedfromex- periments conducted under similar conditions. Furthermore, the use of techniques such as PCA to reduce dimensionality in these settings can be called into question (Chang, 1983) because (a) the inferred sub-populations may not be sparse in the expression patterns of thegenes,inwhichcasePCAhasbeentheoreticallyproventoproduceinconsistentresults (Johnstone & Lu, 2009),(b) there is no guarantee that cluster information is aligned in the direction of maximum variance. As a motivational example, consider the problem of cel- lular sub-population detection in a single-cell RNAseq data that was analyzed in Giecold et al. (2016) (henceforth referred to as G16). This massive dataset holds the expression 3 Figure 1: APPLICATION OF COSCI TO RNA-SEQ DATA OF G16. At top left we have the heatmap of the 8716 × 2730 expression matrix, where red denotes high expression and bisque stands for low expression. A priori we know that there are 19 sub-populations among the cells. The goal is to detect these sub-populations and study their composition with respect to the 33 lineage markers. The plot at bottom left shows the distribution of feature scores S . The dashed horizontal line is the screening threshold of 0.188, which is j chosenbyCOSCI.Allthefeatureswithscoresabovethethresholdareselected. Theblack dotsarethe33lineagemarkers. Theheatmapatthebottomrightshowsthecompositionof the 19 cellular sub-populations among the 33 lineage markers. The 12 lineage markers in bluearetheselectedones. Attoprightistheplotoftheerrorrate(y-axis)vs. theproportion of features selected (x-axis) for different methods. COSCI selects the fewest features and applyingk-meansorSparsek-meanstotheCOSCIselectedfeaturesreturnsasmallererror rate. levels of p = 8716 genes for n = 2730 mice bone marrow cells (Paul et al., 2015), and one of the key scientific objectives is to infer the lineage pattern of the identified sub- populations based on 33 lineage markers. If the sub-populations differ with respect to a relatively small subset of all the genes considered, then, from the statistical perspective, theproblemreducestoscreeningoutthenon-informativefeaturesfromthedataconsisting of vectors X ∈ Rn, j = 1,2,··· ,p and identifying a subset of features that retains the j 4 cluster information. As such, identifying the best subset of the genes with respect to the clusterinformationisanimportantstatisticalendeavorwithcriticalbiologicalimplications (Witten&Tibshirani,2012andthereferencestherein). Unlikethemethodsthatusedownsampling,COSCIcanaccommodatesamplesizeson the order of 106. COSCI first produces a score, S ∈ (0,0.5], for each feature, which re- j flects its relative importance for clustering, and then screens out the features with lower scores. When applied to the aforementioned RNASeq data, COSCI orders the scores of the 8716 genes and selects the top 2304 genes, which include 12 of the 33 lineage markers (see Figure 1, bottom left; the selected genes are highlighted in solid black and the black dots are the 33 lineage markers). The heatmap (Figure 1, bottom right) of the expressions forthe19sub-populationsdetectedviak-meansontheCOSCIselectedfeaturesshowsthat theinferredsub-populationsdiffersignificantlyacrossthe12selectedlineagemarkers. Af- ter applying k-means to detect the sub-populations on the 2304 genes selected by COSCI wegotamisclassificationerror(computedasCER=1−RandIndex,Rand,1971;Chipman & Tibshirani, 2006) of approximately 0.15. This error is significantly smaller than several other methods that are widely used for such clustering problems especially when the pro- portion of selected features is taken into account (see Figure 1, top right). We revisit this examplewithmoredetailsinSection4.2. 1.1 Connections to Related Statistical Literature Within the statistical literature, a number of recently proposed clustering approaches exe- cute feature screening as the first step and then rely on conventional clustering techniques, suchask-means,toclustertheremainingdata. Forexample,Chan&Hall(2012)proposed a non-parametric feature screening method that is based on coordinate-wise Excess Mass tests [(Cheng & Hall, 1998)]. They rank the features using the values of the correspond- ing test statistic. Feature selection then follows by identifying a kink in the plot of the within cluster sum of squares versus the number of identified clusters. Witten & Tibshi- rani (2012) proposed the Sparse k-means and Sparse Hierarchical clustering approaches, whichemployk-meansandhierarchicalclustering,respectively,onafeatureweighteddis- similarity matrix, where the weights are encouraged to be sparse. Their method is largely inspiredbythepopularCOSAalgorithmofFriedman&Meulman(2004)andismoreadept at sparse clustering. Recently, Arias-Castro & Pu (2016) proposed Sparse Alternate Sum 5 (SAS) clustering, which uses a hill-climbing approach to solve the sparse k-means opti- mizationproblem. Ontheparametricside,severalmodelbasedclusteringapproacheshave been introduced (Pan & Shen (2007), Wang & Zhu (2008), Xie et al. (2008)). These tech- niques typically maximize a penalized likelihood under a Gaussian mixture model, where the penalization serves the purpose of implicit feature selection. Jin & Wang (2014) and Jin et al. (2015) propose IF-PCA, which is a two step clustering method - the first step conducts coordinate wise feature selection, and the second step performs k-means cluster- ing on the matrix of left singular vectors of the selected features. The feature selection stepusestheKolmogorov-SmirnovtestforNormalitytorankthefeatures,followedbythe use of the Higher Criticism (HC) (Donoho & Jin, 2004, 2008) functional to finally select the features. Theoretical properties of clustering algorithms that combine feature selection with clustering have also been recently studied. For example, Azizyan et al. (2013) pro- videinformationtheoreticboundsonclusteringaccuracyofthehighdimensionalGaussian mixtures, while Arias-Castro & Verzelen (2014) establish minimax rates for the problems ofmixturedetectionandfeatureselectionunderthesparsityassumption. OurworkisclosertotheapproachesofWitten&Tibshirani(2012),Jin&Wang(2014) andChan&Hall(2012),wheretheobjectiveistoscreenoutthenoisefeatures. Weanalyze the problem of feature screening in large scale clustering and propose COSCI - a novel computationally efficient screening procedure with strong theoretical motivation. COSCI uses a non-parametric approach to rank order the features by their clustering leverage. In this respect, it differs from the recently proposed screening techniques, such as IF-PCA, which rely on a parametric family as a point of reference to gauge feature strength for clustering. 1.2 Organization of the Paper In Section 2 we present and discuss our screening methodology. More specifically, Algo- rithm 1 provides the details of the implementation, while Section 2.1 contains the main theoretical motivation and associated results. Section 3 provides two approaches that aid theselectionofthescreeningthreshold. InSection4weconductadetailedempiricalanal- ysisofourapproachusingbothsimulateddataandrealdatafrommicroarrayexperiments. Section 5 concludes the paper. Proofs and additional technical details are relegated to the Appendix. 6 2 Methodology and Main Results We consider the problem of clustering n observations based on p features under the prior information that a high proportion of these p features contain no clustering information. Noting that these “noise” features have unimodal marginal distributions (which may differ across the features), we develop a univariate approach, which, based on the sample obser- vations, can determine if the underlying true density is unimodal. As there are many noise features, this detection of unimodality in the marginal distributions has to be conducted with very high precision. Also, features containing relevant cluster information have to possess non unimodal marginal population distribution lest they get screened out as noise bytheaforementionedapproach. Inmostpracticalexamples,itispossibletoaccommodate thelastrequirementbyconsideringlinearcombinationsoffeatures(pairwiselinearcombi- nationsbeingthesimplestcase),whichcanthenbeexaminedformultimodality. Weexpect thatmarginallyuninformativefeaturescontainingcorrelatedsub-populationlevelinforma- tion will be revealed through multi-modality of their linear combinations (see Section 4.3 for further details). However, these linear combinations increase the inherent dimension- ality of the problem. For example, if we consider a fixed number of linear combinations for each pair of the original p features, we end up having to examine on the order of p2 new features. In this case, we leverage the scalability of our proposed marginal approach. Theorem 1, provided below, shows that the marginal screening method can be accurately usedeveninsetupswherepisexponentiallylargewithrespectton. Our methodology is formalized in Algorithm 1 below. See Appendix 6.8 for a detailed descriptionofthecomputationalstepsinvolvedinAlgortihm1. ALGORITHM 1: COSCIprocedureforfeaturescreening INPUT:datamatrixX andtuningparameterα n×p 0 FOReachj ∈ {1,··· ,p}: INITIALIZE: Sortdatainascendingorderandstorethemasx = {x ,...,x }. 1 n Setk,thenumberofclusters,equalton. Foreachiin1,··· ,n,setc = {x }. i i REPEAT: 1. ConvexMergingAlgorithm: Findconsecutiveadjacentcentroiddistances: d(r,r+1) ← (c −c )/(|c |+|c |). r+1 r r r+1 7 Findclusterswithminimummergingdistance: r∗ ← argmin d(r,r+1). r Mergeclustersr∗,r∗ +1,re-labelremainingclustersandsetk ← k −1. 2. MergeSize: Findthemergesize,αj,usingequation(2). i Findthemassaftermerge,m ,usingequation(3)and i screenthemergesize: αj = 0ifm < 0.5. i i UNTILk = 1 STORE:Clusteringscores: S = max αj j 1≤i≤n−1 i FEATURESCREENING:I(cid:98) = {j : S ≥ α } S j 0 Next,weprovideintuitiononsomeofthekeyingredientsofourscreeningalgorithm. Convex Merging Algorithm. With the goal of checking unimodality for each of the feature coordinates,weconsiderthefollowingunivariateoptimizationproblem: n (cid:88) (cid:88) min (x −c )2 +λ |c −c |. (1) i i k l c1,···,cn∈R i=1 1≤k<l≤n It is based on the observations x and corresponds to minimizing the within cluster sum i of squares under constraints on the (cid:96) distance between the cluster centroids c . Here λ 1 k is a non-negative penalty weight. The convexity of the objective function in (1) has been exploitedtodevelopalgorithmsforefficientlyproducingthepathofsolutionsasafunction of the penalty weight (Hoefling, 2010; Hocking et al., 2011; Radchenko & Mukherjee, 2014). Clustering algorithms based on fusion penalization of this type have become very popularinlargescaleclustering(Hockingetal.,2011;Radchenko&Mukherjee,2014;Zhu et al., 2014; Tan & Witten, 2015; Chi & Lange, 2015) and regression analysis (Bondell & Reich,2008;Keetal.,2013;Shen&Huang,2010;Shenetal.,2012). The solution of the objective criterion (1) can be found for all values of λ by a simple merge algorithm in O(nlogn) operations. Starting with n observations in n clusters, we sequentiallymergethenearest(intermsoftheweighteddistanceasshowninAlgorithm1) adjacentcentroidsuntilweareleftwithjustoneclusterintheend. MergeSizes. Foreachmergeonthepath,wecanexplicitlytrack(Algorithm1)theparam- eterλ,aswellasthemergesize,whichisdefinedforthei-thmergeas α = n−1min{|C |,|C |}, (2) i i1 i2 8 where|C |and|C |arethecardinalitiesofthetwocorrespondingsub-clusters. Thefunda- i1 i2 mentalworkingprincipleofourproposedCOSCIalgorithmrestsonthefollowingproperty of the merge sizes: if x ,··· ,x are indeed generated by a non-informative density, then 1 n the sample merges {α : 1 ≤ i ≤ n−1} will be uniformly small for sufficiently large n. i This property is formalized in Theorem 1 below. On the other hand, if the underlying dis- tribution contains a moderate amount of cluster information, then the merge sequence will have at least one merge that is big. The last fact is illustrated by Theorem 1 in Radchenko & Mukherjee (2014). Based on these properties of the merge sequences, we conduct the followingscreeningprocedure. Screening the Merges. Given a pre-defined threshold α , we flag the feature as potential 0 “signal”,ifthereexistsamerge,saytheith merge,suchthat α > α and m := n−1(|C |+|C |) ≥ 0.5. (3) i 0 i i1 i2 The restriction on the mass after merge, m , ensures that we only identify a merge as big i if it results in a significantly large cluster. This protects us from the risk of discovering potentialbigmergesonsmallerfragmentsofthesample,wherethenatureofthesemerges can be very fragile due to sampling fluctuations. In Algorithm 1, the after merge size control of equation (3) is easily implemented by just resetting the merge sizes to 0 if the mass after merge is less than 0.5. Under mild regularity conditions, the aforementioned fundamental properties of our proposed COSCI algorithm are established in Section 2.1 and its associated Appendix 6.1. As far as the implementation is concerned, a significant gain in computational time is achievable via a parallel implementation of the top for loop inAlgorithm1thatrunsacrossthepfeatures. 2.1 Theoretical Support: Perfect Screening Property Let I and I be the index sets corresponding to the “signal” and the “noise” features, S N respectively. Define p = |I | and p = |I |. Given feature j, we write S (τ) for S S N N j the corresponding largest merge size, computed the same way as S in Algorithm 1, but j under the restriction that the midpoint between the two merged sub-clusters lies between the 100τ-th and 100(1−τ)-th sample quantiles. Here τ is an arbitrarily small but positive number. 9 Theorem 1, stated below, establishes a uniform non-asymptotic bound on the merge sizes that are produced when our procedure is applied to the noise features. The regularity conditions, C1 and C2, are imposed on the family of marginal distributions of the noise features, and are fairly mild. In particular, they are satisfied for location-scale families of unimodal differentiable densities with a finite first moment, as long as the corresponding locationandscaleparametersareuniformlybounded. TheproofofTheorem1isprovided inAppendix6.1. Theorem 1. Suppose that regularity conditions C1 and C2, stated in Appendix 6.1, are satisfied. For each τ > 0 there exist positive constants c , c , b and κ, whose choice does 1 2 notdependoneithernorp ,suchthat,aslongasp ≤ exp(κn),inequalities N N log(p ∨n) N maxS (τ) ≤ b j j∈IN n holdwith(high)probabilityboundedbelowby1−c p−c2. 1 N The above theorem provides the theoretical justification for the screening step in our proposed procedure. The result is non-asymptotic, and the proof involves careful analysis of the empirical process associated with the merging algorithm for each feature. Under a mild restriction on the number of features relative to the sample size, the above theo- rem ensures that the clustering scores of all the noise features are uniformly very close to zero. Thus, if we use any arbitrarily small but prefixed value for the threshold α , we have 0 theoretical guarantees for perfectly screening out all the noise coordinates. It is important to have a small value of α to avoid screening out the informative features, which have 0 non-negligible clustering scores S . Theorem 1 suggests that αOR = blog(p ∨ n)/n is a j 0 reasonable choice. Provided n is sufficiently large, an approach using the above choice of α will not screen out the features identified as multi-cluster features by the popula- 0 tion clustering procedure, defined in Section 2.2 of Radchenko & Mukherjee (2014). The next result, which is a consequence of Theorem 1 above and Theorem 1 in Radchenko & Mukherjee (2014), formalizes this point. Note that Radchenko & Mukherjee (2014) demonstrate, through simulations and theoretical analysis, that the population procedure generally classifies multi-modal distributions as multi-cluster, provided the corresponding sub-populationsareofreasonablesize. Corollary 1. Suppose that regularity conditions C1 and C2, stated in Appendix 6.1, are 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.