Statistical Archetypal Analysis Chenyue Wua,∗, Esteban G. Tabaka aDepartment of Mathematics, Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York, NY 10012, United States Abstract Statistical Archetypal Analysis (SAA) is introduced for the dimensional reduc- tionofacollectionofprobabilitydistributionsknownviasamples. Applications includemedicaldiagnosisfromclinicaldataintheformofdistributions(suchas distributionsofbloodpressureorheartratesfromdifferentpatients),theanaly- sisofclimatedatasuchastemperatureorwindspeedatdifferentlocations,and the study of bifurcations in stochastic dynamical systems. Distributions can be embedded into a Hilbert space with a suitable metric, and then analyzed simi- larly to feature vectors in Euclidean space. However, most dimensional reduc- tion techniques –such as Principal Component Analysis– are not interpretable fordistributions,asneitherthecomponentsnorthereconstructionofinputdata by components are themselves distributions. To obtain an interpretable result, Archetypal Analysis (AA) is extended to distributions, requiring the compo- nents to be mixtures of the input distributions and approximating the input distributions by mixtures of components. Keywords: archetypal analysis, dimension reduction, energy distance, kernel embedding, principal component analysis 1. Introduction 1 Finitecollectionsofprobabilitydistributionsappearnaturallyinavarietyof 2 settings, often as conditional distributions ρ(x|z) where z adopts a discrete set 3 of values. For instance, x may represent a collection of clinical variables such 4 as body temperature, blood pressure and cholesterol level, and z may stand for 5 covariatessuchassex, agegroupormedicaltreatment. Inanexamplethatthis 6 paper analyzes in some detail, x is the atmospheric temperature measured at 7 groundlevelandzstandsforthestationwherethemeasurementsareperformed. 8 It is therefore a natural extension of data analysis to use as either labels 9 or features, probability distributions instead of the more conventional discrete- 10 valued variables, continuum scalars or vectors. Thus one might want to predict 11 ∗Correspondingauthor Email addresses: [email protected](ChenyueWu),[email protected](Esteban G.Tabak) Preprint submitted to Elsevier February 2, 2017 not the temperature at a particular location and time but its probability distri- 12 bution,orclusterpopulationsformedicalpurposesaccordingtotheprobability 13 distributions of a group of clinical variables. 14 A basic quantity that permeates data analysis is the distance between data 15 points. There are several statistical distances in the literature that measure 16 thedissimilaritybetweentwoprobabilitydistributions. Somearebasedonana- 17 logues of the Euclidean distance, some on information theory, some on optimal 18 transport. Typically, each sheds a different light on what makes two distri- 19 butions different. In this article, we use the energy distance as a measure of 20 dissimilarity among distributions, as it is easy to evaluate efficiently from sam- 21 ple points and can be derived from an inner product, thus rendering accessible 22 many data analysis tools. 23 We study the problem of dimensional reduction of sets of distributions. Af- 24 ter being equipped with a metric and embedded into a Hilbert space, distri- 25 butions can be analyzed similarly to conventional feature vectors. However, 26 there is a gap between the dimensional reduction of distributions and vectors: 27 interpretability. Traditional dimension reduction techniques, such as principal 28 components analysis, lack interpretability when applied to probability distribu- 29 tions, as the projection of each distribution onto the low dimensional subspace 30 found is almost surely not a probability distribution: even though probability 31 distributions can be embedded into a Hilbert space, almost all elements in this 32 spacearenotprobabilitydistributions,sincetheseareconstrainedbypositivity 33 and normalization. 34 To overcome this difficulty in interpretation, we use the tools of archetypal 35 analysis. Archetypal analysis finds a small number of “archetypes” that are 36 convex combinations of the original data points, and approximates the origi- 37 nal data points again via convex combinations of these archetypes. A convex 38 combination can be interpreted as a mixture of probability distributions, so 39 the archetypes found by archetypal analysis are mixtures of the original dis- 40 tributions and the original distributions are approximated within the family of 41 mixtures of the archetypes. 42 Thispaperisarrangedasfollows: Section2givesareviewofarchetypalanal- 43 ysis,ofthealgorithmsforarchetypalanalysisinthegeneralcaseandspecifically 44 for energy distance. Section 3 reviews reproducing kernel Hilbert space, energy 45 distance, describes how distributions equipped with the energy distance can be 46 embeddedintoaHilbertspace, anddescribesalgorithmstoevaluatetheenergy 47 distance from samples. Section 4 introduces statistical archetypal analysis for 48 thedimensionalreductionofprobabilitydistributionsandincludesapplications 49 with numerical experiment. 50 2. Archetypal Analysis 51 Archetypalanalysisapproximatesdatapointsbyconvexcombinationofpro- 52 totypes, where these prototypes, denoted “archetypes”, are themselves convex 53 combinations of the data points. 54 2 Archetypal analysis was introduced in [1] –see also [2]– as a dimensional 55 reduction method alternative to principal components analysis (PCA), yielding 56 more interpretable results. It originated in the study of a dataset consisting of 57 6headdimensionsfor200soldiers,withthegoalofdesigningfacemasksforthe 58 Swiss Army. For this dataset, PCA found principal components that did not 59 resemble a head shape. To have patterns resembling “pure types” in the data, 60 each entry in the dataset was approximated by a mixture of the patterns. To 61 make patterns resemble the data, each pattern itself was a mixture of the data 62 points. 63 For a data matrix X =(x ,x ,··· ,x ) representing n observations, each of 1 2 n dimension m, Archetypal Analysis seeks k (cid:28)n m-dimensional archetypes Z = (z ,z ,··· ,z ),suchthateachx canbeapproximatedbyaconvexcombination 1 2 k i of the z : k (cid:88) x ≈a z +a z +···+a z , a ≥0, a =1, i 1i 1 2i 2 ki k ji ji j where the z themselves are convex combinations of the data: j (cid:88) z =b x +b x +···+b x , b ≥0, b =1. j 1j 1 2j 2 nj n ij ij i After setting a number of archetypes k, the coefficients a and b arise from the optimization problem (cid:13) (cid:13)2 n (cid:13) k n (cid:13) (cid:88)(cid:13) (cid:88) (cid:88) (cid:13) min (cid:13)xi− aji bljxl(cid:13) , (1) aji,blj i=1(cid:13)(cid:13) j=1 l=1 (cid:13)(cid:13) with constraints (cid:88) (cid:88) a ≥0, a =1, b ≥0, b =1, ji ji ij ij j i or, in terms of the matrices A=(a ) and B =(b ) , ji k×n lj n×k min(cid:107)X−XBA(cid:107)2 (2) F A,B under the same constraints, with (cid:107)·(cid:107) denoting the Frobenius norm 64 F (cid:32) (cid:33)1 p q 2 (cid:107)M(cid:107) = (cid:80) (cid:80) |m |2 . Alternatively, this can be written as: F ij i=1j=1 (cid:124) A,B =argmin tr[(I −BA) G(I −BA)], (3) n n where G = X(cid:124)X is the Gram matrix of data. This restatement is particularly 65 convenient,asitwillallowustoformulatetheproblemintermsofinnerproducts 66 among the data points instead of the points themselves, which in our problem 67 are distributions. Thus we need a norm for distributions that derive from an 68 inner product, for which we will adopt the energy distance. 69 3 3. Energy Distance 70 The energy distance is a metric defined on probability measures ([3, 4]), 71 which we will use to measure dissimilarity among probability distributions. 72 Definition 1 (Energy Distance). For probability measures µ,ν on Rd, random vectorsX,X(cid:48) ∼µ(x),Y,Y(cid:48) ∼ν(y),E(cid:107)X(cid:107)<∞,E(cid:107)Y(cid:107)<∞,theenergydistance between µ and ν, D(µ,ν), is defined by D2(µ,ν)=2E(cid:107)X−Y(cid:107)−E(cid:107)X−X(cid:48)(cid:107)−E(cid:107)Y −Y(cid:48)(cid:107), (4) where (cid:107)·(cid:107) is the Euclidean norm on Rd, and X, X(cid:48), Y and Y(cid:48) are pairwise 73 independent. 74 The energy distance as defined above is a metric on distributions ([5], [6]). It can be viewed as the metric induced by kernel embedding ([7]) with kernel k(x,y)=(cid:107)x−x (cid:107)+(cid:107)y−y (cid:107)−(cid:107)x−y(cid:107), (5) 0 0 wherex isafixedvalueinRd,whosechoicedoesnotaffecttheinducedmetric. 0 The kernel induces an inner product between distributions P and Q: (cid:104)P,Q(cid:105)=E k(X,Y) (6) X,Y where X ∼P, Y ∼Q, with X and Y independent. The corresponding square- distance is given by γ2(P,Q)=(cid:104)P,P(cid:105)+(cid:104)Q,Q(cid:105)−2(cid:104)P,Q(cid:105) k (7) =E k(X,X(cid:48))+E k(Y,Y(cid:48))−2E k(X,Y), XX(cid:48) YY(cid:48) XY wheretherandomvectorsX,X(cid:48) ∼P(x),Y,Y(cid:48) ∼Q(y)arepairwiseindependent (conditions for kernels to yield a metric can be found in [5, 8]). In terms of the kernel in (5), γ2(µ,ν)=2E(cid:107)X−x (cid:107)−E(cid:107)X−X(cid:48)(cid:107)+2E(cid:107)Y −x (cid:107)−E(cid:107)Y −Y(cid:48)(cid:107) k 0 0 −2E(cid:107)X−x (cid:107)−2E(cid:107)Y −x (cid:107)+2E(cid:107)X−Y(cid:107)=D2(µ,ν). 0 0 Anumberofdistancesfordistributionsisavailableintheliteratureofstatis- 75 tics,probabilityandinformationtheory,suchastheKullback-Leiblerdivergence 76 ([9, 10]) and the p-Wasserstein metric between two probability measures µ(x) 77 and ν(x) on a metric space (M,d) ([11]). We chose the energy distance be- 78 causeitcanbeestimatedefficientlyfromsamplesanditembedstheprobability 79 measures into a Hilbert space, which facilitates further analysis. 80 3.1. Estimating the Energy Distance from Data 81 In calculating the energy distance between two distributions µ and ν given independent random vectors X ∼µ, Y ∼ν and their i.i.d. copies X(cid:48), Y(cid:48), (cid:112) D(µ,ν)= 2E(cid:107)X−Y(cid:107)−E(cid:107)X−X(cid:48)(cid:107)−E(cid:107)Y −Y(cid:48)(cid:107), 4 oneneedstoevaluatethreeexpectations: E(cid:107)X−Y(cid:107),E(cid:107)X−X(cid:48)(cid:107)andE(cid:107)Y −Y(cid:48)(cid:107). 82 If we only have samples of µ and ν, these expectation can be approximated by 83 their empirical means. 84 Specifically, when we have samples {x }nX of µ and {y }nY of ν, we can i i=1 j j=1 estimate the energy distance between µ and ν by the energy distance between their corresponding empirical distributions µˆ and νˆ: (cid:113) D(µˆ,νˆ)= 2E(cid:107)Xˆ −Yˆ(cid:107)−E(cid:107)Xˆ −Xˆ(cid:48)(cid:107)−E(cid:107)Yˆ −Yˆ(cid:48)(cid:107). (8) In the equations above, E(cid:107)Xˆ −Yˆ(cid:107)= 1 i=nX(cid:88),j=nY (cid:107)x −y (cid:107) (9) n n i j X Y i,j=1 is the empirical mean of E(cid:107)X−Y(cid:107). For Xˆ(cid:48), we use the same samples available for X, E(cid:107)Xˆ −Xˆ(cid:48)(cid:107)= 1 i=nX(cid:88),i(cid:48)=nX(cid:107)x −x (cid:107). (10) n n i i(cid:48) X X i,i(cid:48)=1 Similarly, E(cid:107)Yˆ −Yˆ(cid:48)(cid:107)= 1 j=n(cid:88)Y,j=nY (cid:107)y −y (cid:107). (11) n n j j(cid:48) Y Y j,j(cid:48)=1 According to the formulations above for estimating energy distance from 85 samples, if we have n sample points for µ and n sample points for ν, the 86 X Y time complexity of estimating their energy distance is O(n n +n2 +n2). 87 X Y X Y Thecorrespondinginnerproductbetweendistributionsµandν, giveninde- pendent random vectors X ∼µ, Y ∼ν and X(cid:48), Y(cid:48), is (cid:104)µ,ν(cid:105)=E(cid:107)X−x (cid:107)+E(cid:107)Y −x (cid:107)−E(cid:107)X−Y(cid:107), (12) 0 0 where x is a fixed point. Similarly, calculation of this inner product involves 0 threeexpectations: E(cid:107)X−x (cid:107),E(cid:107)Y−x (cid:107),E(cid:107)X−Y(cid:107). Whenµandν areknown 0 0 via their samples {x }nX and {y }nY , their inner product (µ,ν) is estimated i i=1 j j=1 by (cid:104)µˆ,νˆ(cid:105)=E(cid:107)Xˆ −x (cid:107)+E(cid:107)Yˆ −x (cid:107)−E(cid:107)Xˆ −Yˆ(cid:107), (13) 0 0 where E(cid:107)Xˆ −x (cid:107)= 1 (cid:88)nX (cid:107)x −x (cid:107), (14) 0 n i 0 X i=1 E(cid:107)Yˆ −x (cid:107)= 1 (cid:88)nY (cid:107)y −x (cid:107), (15) 0 n j 0 Y j=1 E(cid:107)Xˆ −Yˆ(cid:107)= 1 i=nX(cid:88),j=nY (cid:107)x −y (cid:107), (16) n n i j X Y i,j=1 5 Algorithm 1 Generic algorithm for estimating the energy distance Input: Samples x ,y of X,Y respectively. i j Output: Empirical estimation of E(cid:107)X−Y(cid:107). 1: procedure Energy({xi},{yj}) 2: sum=0 3: for all xi do 4: for all yj do 5: sum=sum+(cid:107)xi−yj(cid:107) 6: end for 7: end for 8: return sum nXnY 9: end procedure and the time complexity of estimating this inner product is O(n n ). If we 88 X Y have n sample points for both µ and ν, the time complexity is O(n2). 89 Noticethatestimatingtheenergydistanceandthecorrespondinginnerprod- 90 uct from n sample points of both distributions have time complexities O(n2), 91 whichbecomescomputationallyexpensivewhenusingalargenumberofsample 92 points. In the following section, a fast algorithm for energy distance between 93 one-dimensional distributions is introduced, making the application of energy 94 distance much more efficient. 95 3.2. Fast Algorithm in One Dimension 96 According to (9), (10), (11), (14), (15) and (16), both the data-based com- putations of energy distance (8) and corresponding inner product (13) have the same complexity of evaluating E(cid:107)Xˆ −Yˆ(cid:107)= 1 (cid:88)nX (cid:88)nY (cid:107)x −y (cid:107), (17) n n i j X Y i=1j=1 where the (x , y ) are (n , n ) samples of (X, Y). Generally, the time com- 97 i j X Y plexity of evaluating (17) is O(n2) via Algorithm 1, which simply takes the 98 arithmetic mean of (cid:107)x −y (cid:107). 99 i j 100 In one-dimensional space, however, thefact that (cid:107)·(cid:107)=|·| enables us to use 6 the identity |x−y|=1 (x−y)−1 (x−y) to obtain x−y>0 x−y≤0 E(cid:107)Xˆ −Yˆ(cid:107) 1 (cid:88)nX (cid:88)nY = |x −y | n n i j X Y i=1j=1 1 (cid:88)nX (cid:88)nY = 1 (x −y )−1 (x −y ) n n {xi−yj>0} i j {xi−yj≤0} i j X Y i=1j=1 = 1 (cid:88)nX #{j|yj <xi}−#{j|yj ≥xi}x + 1 (cid:88)nY #{i|xi ≤yj}−#{i|xi >yj}y , n n i n n j X Y Y X i=1 j=1 where #{···} denote the number of elements in a set. 101 If {x }nX and {y }nY are sorted arrays, the latter expression can be calcu- 102 i i=1 j j=1 lated in the linear time O(n +n ), since each of #{j|y <x }, #{j|y ≥x }, 103 X Y j i j i #{i|x ≤y } and #{i|x >y } can be calculated in linear time by merging 104 i j i j {x }nX and {y }nY into one sorted array (Algorithm 2.) 105 i i=1 j j=1 If given unsorted samples, we need to sort them before applying Algorithm 106 2. Feasible sorting algorithms are quick sort, which has an O(nlogn) average 107 complexityandanO(n2)worstcasecomplexity,heapsortandmergesort,which 108 have an O(nlogn) worst case complexity. Therefore even for unsorted samples, 109 thecomplexityofestimatingtheenergydistancecanbeboundedbyO(nlogn). 110 4. Statistical Archetypal Analysis 111 4.1. Dimensional Reduction 112 In this section, we study the dimensional reduction of probability distri- 113 butions, mapping a collection of distributions to a low-dimensional space with 114 minimal loss of information. Probability distributions have infinite dimension; 115 when they are known via samples, they can be said to have a dimensionality of 116 the order of the number of samples points. Our dimensional reduction on this 117 high-dimensional dataset consists of two steps: we embed the distributions into 118 an Euclidean space, and then use dimensional reduction methods developed for 119 Euclidean spaces. 120 Probability distributions equipped with the energy distance form a convex 121 subset of a Hilbert space. Therefore a collection of N distributions µ can be 122 i naturally embedded into an N-dimensional Euclidean space, since every finite 123 dimension subspace of a Hilbert space is isometric to an Euclidean space. 124 Assume that x ∈ RN, i ∈ [1,2,··· ,N], are points in RN such that (cid:107)x − i i x (cid:107)=D(µ ,µ )whereD(·)istheenergydistance(Inotherwords, x istheim- j i j i ageofµ undertheembeddingintoanEuclideanspace.) PrincipalComponents i Analysis (PCA) solves the following optimization problem for centered x : i (cid:13) (cid:13)2 N (cid:13) K (cid:13) (cid:88)(cid:13) (cid:88) (cid:13) min (cid:13)xi− ajizj(cid:13) (18) zj,ajii=1(cid:13)(cid:13) j=1 (cid:13)(cid:13) 7 Algorithm 2 Fast algorithm for estimating the energy distance in 1D Input: Sorted samples x ,y of 1D random variable X,Y respectively i j Output: Empirical estimation of E(cid:107)X−Y(cid:107) 1: procedure FastEnergy({xi},{yj}) 2: sumX =0,sumY =0,i=1,j =1 3: while i≤nX and j ≤nY do 4: if xi ≤yj then 5: sumX =sumX + (j−1)−[nnYY−(j−1)]xi 6: i=i+1 7: else 8: sumY =sumY + (i−1)−[nnXX−(i−1)]yj 9: j =j+1 10: end if 11: end while 12: if i>nX then n(cid:80)Y 13: sumY =sumY + yk k=j 14: else n(cid:80)X 15: sumX =sumX + xk k=i 16: end if 17: return sumX/nX +sumY/nY 18: end procedure under the constraints that the z are orthonormal vectors. Thus PCA maps 125 j eachdatapointtotheclosestpointinthevectorspacespannedbythez . Here 126 j K is the dimension of the low dimensional space sought, and (cid:80)K a z is the 127 j=1 ji j image of x under this dimensional reduction. 128 i PCA and other mainstream dimensional reduction techniques are not ap- 129 propriate for probability distributions from two perspectives: 1) the z in (18) 130 j is generally not a probability distribution, neither are almost all points in the 131 spacespannedbythez . 2)thecoefficientsa foreachx in(18)maybenega- 132 j ji i tive,sotheycannotclearlyexpresshoweachz contributestotherepresentation 133 j of x . We will use instead archetypal analysis for the dimensional reduction of 134 i distributions, which does not suffer from this lack of interpretability. 135 4.2. Statistical Archetypal Analysis 136 As seen in Section 2, archetypal analysis has a formulation similar to (18), except that it requires the optimization of (cid:13) (cid:13)2 N (cid:13) K (cid:13) (cid:88)(cid:13) (cid:88) (cid:13) min (cid:13)xi− ajizj(cid:13) (19) zj,ajii=1(cid:13)(cid:13) j=1 (cid:13)(cid:13) 8 K under the constraints that a (cid:62) 0, (cid:80) a = 1, and each z is a convex combi- 137 ji ji j j=1 N N nation of the x , i.e. z = (cid:80)b x , with b (cid:62)0, (cid:80)b =1. 138 i j lj l lj lj l=1 l=1 Switchingfromvectorsx ,z todistributionsµ ,ν andusingenergydistance i j i j instead of Euclidean distance, statistical archetypal analysis adopts the form (cid:13) (cid:13)2 N (cid:13) K (cid:13) N (cid:88)(cid:13) (cid:88) (cid:13) (cid:88) min (cid:13)µi− ajiνj(cid:13) , νj = bljµl, (20) νj,ajii=1(cid:13)(cid:13) j=1 (cid:13)(cid:13) l=1 with the same constraints over the a and b, which now adopt the natural inter- 139 pretationthattheν aremixturesoftheµ andthelatterarewell-approximated 140 j i by mixtures of the ν . (cid:107)·(cid:107) in (20) is the energy distance, but can naturally be 141 j extended to any metric induced by kernel embedding as discussed in Section 3. 142 Since the energy distance, that we shall use for the norm in (20), derives fromaninnerproduct,statisticalarchetypalanalysiscanberewrittenasin(3): (cid:124) argmin tr[(I −BA) G(I −BA)], (21) n n A,B where each column of A represents one archetype as a convex combination of theoriginaldistributions,andeachcolumnofB containsthecoefficientsforthe approximate reconstruction of each original distribution from the archetypes. G is the Gram matrix of pairwise inner products among the distributions, G =Ek(X ,X ) ij i j for independent X ∼µ and X ∼µ and kernel k. 143 i i j j 144 When each µi is known via samples {ym(i)}Mm=i1 of size Mi, we can replace µi byitsempiricaldistributionatdatapointsy(i) withweights 1 . Inthissetting, 145 m Mi ν becomes an empirical distribution concentrated at the union of the y(l) over 146 j m l=1,2,··· ,N, with weights blj for all m. The resulting number of samples of 147 Ml ν appears large, since it contains the support of every empirical distribution 148 j µ . However, since the solution of (19) is sparse, most entries in b are zero, so 149 i lj we only need to keep those data points y(l) for ν where b is non-zero. 150 m j lj Statistical archetypal analysis overcomes the two difficulties in interpreta- 151 tionwhenapplyingdimensionreductiononprobabilitydistribution. Archetypes 152 {ν }k found by archetypal analysis, which are mixtures of the {µ }n , are all 153 i i=1 i i=1 probability distributions. The low-dimensional space used to capture informa- 154 tionofthedatasetofdistributionsinthiscaseistheconvexhullofallarchetypes, 155 i.e. the family of mixtures of all archetypes. Each coefficient a in (19) stands 156 ji for the contribution of the jth archetype ν to µ . 157 j i 4.3. Numerical Examples 158 4.3.1. Synthetic Data 159 Inourfirstexample, wesimulate100probabilitydistributions{µ }100, each 160 i i=1 a Gaussian mixture µ =λ N(−6,2)+(1−λ )N(6,1), where each λ is drawn 161 i i i i 9 independently from the uniform distribution in [0,1]. 162 Figure1: Archetypesofsyntheticdatafork=2. Thecurvesarethefoundarchetypesandthe shadowsarethetwocomponentsN(−6,2)andN(6,1)inthemixturefamilyrespectively. We set number of archetypes k to 2 and perform archetypal analysis on the 163 syntheticdata. ThetwoarchetypesfoundareshownandcomparedtoN(−6,2) 164 and N(6,1), the two components in the mixture family, in Figure 1. Both of 165 themareclosetothecomponentsexceptatthecenterandtailpart. Thisisdue 166 tothedefinitionofarchetypes,whichisamixtureofinputdistributions. Unless 167 wehaveexactlythesetwocomponentsasinput,thearchetypeswillalwayshave 168 a heavier tail. 169 4.3.2. Temperature Data 170 We work with ground temperature data from United States Climate Refer- 171 ence Network (USCRN) Quality Controlled Datasets ([dataset][12], [13], [14]). 172 Thetemperaturedataaremeasuredhourlyin43citiesacrosstheUnitedStates. 173 Weoperateondatafromwhichthediurnalandseasonalsignalhasbeenremoved 174 usingtheoptimal-transportbasedmethodologyin[15]. Inaddition,thisdataset 175 has missing values, which are filled using a low rank approximation to the data 176 matrix. Figure 2 shows the 43 cities on the map with Table 1 a complete list of 177 the cities. 178 10
Description: