A Tutorial on Principal Component Analysis Jonathon Shlens ∗ Systems Neurobiology Laboratory, Salk Insitute for Biological Studies La Jolla, CA 92037 and Institute for Nonlinear Science, University of California, San Diego La Jolla, CA 92093-0402 (Dated: December 10, 2005; Version 2) Principal component analysis (PCA) is a mainstay of modern data analysis - a black box that is widely used but poorly understood. The goal of this paper is to dispel the magic behind this blackbox. Thistutorialfocusesonbuildingasolidintuitionforhowandwhyprincipalcomponent analysisworks;furthermore,itcrystallizesthisknowledgebyderivingfromsimpleintuitions,the mathematicsbehindPCA.Thistutorialdoesnotshyawayfromexplainingtheideasinformally, nordoesitshyawayfromthemathematics. Thehopeisthatbyaddressingbothaspects,readers ofalllevelswillbeabletogainabetterunderstandingofPCA aswellasthewhen,thehowand thewhyofapplyingthistechnique. I. INTRODUCTION II. MOTIVATION: A TOY EXAMPLE Here is the perspective: we are an experimenter. We Principal component analysis (PCA) has been called are trying to understand some phenomenon by measur- one of the most valuable results from applied linear al- ing various quantities (e.g. spectra, voltages, velocities, gebra. PCA is used abundantly in all forms of analysis - etc.) inoursystem. Unfortunately,wecannotfigureout from neuroscience to computer graphics - because it is a whatishappeningbecausethedataappearsclouded,un- simple,non-parametricmethodofextractingrelevantin- clear and even redundant. This is not a trivial problem, formation from confusing data sets. With minimal addi- but rather a fundamental obstacle in empirical science. tional effort PCA provides a roadmap for how to reduce Examples abound from complex systems such as neu- a complex data set to a lower dimension to reveal the roscience, photometry, meteorology and oceanography - sometimes hidden, simplified structure that often under- the number of variables to measure can be unwieldy and lie it. at times even deceptive, because the underlying relation- Thegoalofthistutorialistoprovidebothanintuitive ships can often be quite simple. feel for PCA, and a thorough discussion of this topic. Take for example a simple toy problem from physics Wewillbeginwithasimpleexampleandprovideanintu- diagrammed in Figure 1. Pretend we are studying the itiveexplanationofthegoalofPCA.Wewillcontinueby motion of the physicist’s ideal spring. This system con- adding mathematical rigor to place it within the frame- sistsofaball ofmass mattached toamassless, friction- workoflinearalgebratoprovideanexplicitsolution. We less spring. The ball is released a small distance away will see how and why PCA is intimately related to the from equilibrium (i.e. the spring is stretched). Because mathematical technique of singular value decomposition the spring is “ideal,” it oscillates indefinitely along the (SVD). This understanding will lead us to a prescription x-axis about its equilibrium at a set frequency. for how to apply PCA in the real world. We will discuss Thisisastandardprobleminphysicsinwhichthemo- both the assumptions behind this technique as well as tionalongthexdirectionissolvedbyanexplicitfunction possible extensions to overcome these limitations. of time. In other words, the underlying dynamics can be expressed as a function of a single variable x. Thediscussionandexplanationsinthispaperareinfor- malinthespiritofatutorial. Thegoalofthispaperisto However,beingignorantexperimenterswedonotknow educate. Occasionally, rigorous mathematical proofs are any of this. We do not know which, let alone how necessaryalthoughrelegatedtotheAppendix. Although many, axes and dimensions are important to measure. not as vital to the tutorial, the proofs are presented for Thus, we decide to measure the ball’s position in a the adventurous reader who desires a more complete un- three-dimensional space (since we live in a three dimen- derstandingofthemath. Theonlyassumptionisthatthe sional world). Specifically, we place three movie cameras readerhasaworking knowledge oflinear algebra. Please around our system of interest. At 200 Hz each movie feel free to contact me with any suggestions, corrections camera records an image indicating a two dimensional or comments. position of the ball (a projection). Unfortunately, be- cause of our ignorance, we do not even know what are the real “x”, “y” and “z” axes, so we choose three cam- era axes ~a,~b,~c at some arbitrary angles with respect { } ∗Electronicaddress: [email protected] to the system. The angles between our measurements 2 everytimesample(orexperimentaltrial)asanindividual sample in our data set. At each time sample we record a set of data consisting of multiple measurements (e.g. voltage, position, etc.). In our data set, at one point in time, camera A records a corresponding ball position (x ,y ). One sample or trial can then be expressed as a A A 6 dimensional column vector x A y A x X~ = B y B x C y C FIG. 1 A diagram of the toy example. where each camera contributes a 2-dimensional projec- tion of the ball’s position to the entire vector X~. If we record the ball’s position for 10 minutes at 120 Hz, then mightnotevenbe90o! Now,werecordwiththecameras we have recorded 10 60 120=72000 of these vectors. for several minutes. The big question remains: how do × × With this concrete example, let us recast this problem we get from this data set to a simple equation of x? in abstract terms. Each sample X~ is an m-dimensional Weknowa-priorithatifweweresmartexperimenters, vector, where m is the number of measurement types. we would have just measured the position along the x- Equivalently, every sample is a vector that lies in an m- axiswithonecamera. Butthisisnotwhathappensinthe dimensional vector space spanned by some orthonormal real world. We often do not know which measurements basis. Fromlinearalgebraweknowthatallmeasurement bestreflectthedynamicsofoursysteminquestion. Fur- vectorsformalinearcombinationofthissetofunitlength thermore,wesometimesrecordmoredimensionsthanwe basis vectors. What is this orthonormal basis? actually need! This question is usually a tacit assumption often over- Also,wehavetodealwiththatpesky,real-worldprob- looked. Pretendwegatheredourtoyexampledataabove, lem of noise. In the toy example this means that we butonlylookedatcameraA. Whatisanorthonormalba- need to deal with air, imperfect cameras or even friction sis for (x ,y )? A naive choice would be (1,0),(0,1) , in a less-than-ideal spring. Noise contaminates our data A A { } but why select this basis over (√2,√2),( √2, √2) or set only serving to obfuscate the dynamics further. This { 2 2 −2 −2 } any other arbitrary rotation? The reason is that the toyexampleisthechallengeexperimentersfaceeveryday. naivebasisreflectsthemethodwegatheredthedata. Pre- Wewillrefertothisexampleaswedelvefurtherintoab- tend we record the position (2,2). We did not record stract concepts. Hopefully, by the end of this paper we will have a good understanding of how to systematically 2√2 in the (√22,√22) direction and 0 in the perpindicu- extract x using principal component analysis. lar direction. Rather, we recorded the position (2,2) on our camera meaning 2 units up and 2 units to the left in our camera window. Thus our naive basis reflects the III. FRAMEWORK: CHANGE OF BASIS method we measured our data. How do we express this naive basis in linear algebra? Thegoalofprincipalcomponentanalysisistocompute In the two dimensional case, (1,0),(0,1) can be recast { } the most meaningful basis to re-express a noisy data set. as individual row vectors. A matrix constructed out of The hope is that this new basis will filter out the noise these row vectors is the 2 2 identity matrix I. We can × andrevealhiddenstructure. Intheexampleofthespring, generalizethistothem-dimensionalcasebyconstructing the explicit goal of PCA is to determine: “the dynamics an m m identity matrix are along the x-axis.” In other words, the goal of PCA × is to determine that xˆ - the unit basis vector along the b 1 0 0 1 x-axis - is the important dimension. Determining this b2 0 1 ··· 0 faarcetimalploowrtsaannt,ewxhpiecrhimaernetjeursttoreddiuscnedrannwt haincdhwdyhnicahmaicres B= ... = ... ... ·.·..· ... =I just noise. bm 0 0 1 ··· where each row is an orthornormal basis vector b with i A. A Naive Basis m components. We can consider our naive basis as the effectivestartingpoint. Allofourdatahasbeenrecorded With a more precise definition of our goal, we need in this basis and thus it can be trivially expressed as a a more precise definition of our data as well. We treat linear combination of b . i { } 3 B. Change of Basis We can note the form of each column of Y. p x With this rigor we may now state more precisely what 1· i PCA asks: Is there another basis, which is a linear com- yi = ... bination of the original basis, that best re-expresses our pm xi data set? · A close reader might have noticed the conspicuous ad- We recognize that each coefficient of y is a dot-product i dition of the word linear. Indeed, PCA makes one strin- of x with the corresponding row in P. In other words, i gentbutpowerfulassumption: linearity. Linearityvastly thejth coefficientofy isaprojectionontothejthrowof i simplifiestheproblemby(1)restrictingthesetofpoten- P. Thisisinfacttheveryformofanequationwherey is i tialbases, and(2)formalizingtheimplicitassumptionof a projection on to the basis of p ,...,p . Therefore, 1 m continuity in a data set.1 the rows of P are indeed a new{ set of basi}s vectors for With this assumption PCA is now limited to re- representing of columns of X. expressing the data as a linear combination of its ba- sis vectors. Let X be the original data set, where each columnisasinglesample(ormomentintime)ofourdata C. Questions Remaining set (i.e. X~). In the toy example X is an m n matrix where m = 6 and n = 72000. Let Y be ano×ther m n By assuming linearity the problem reduces to find- matrix related by a linear transformation P. X is×the ing the appropriate change of basis. The row vectors originalrecordeddatasetandY isare-representationof p1,...,pm in this transformation will become the { } that data set. principal components of X. Several questions now arise. PX=Y (1) What is the best way to “re-express” X? • Also let us define the following quantities.2 What is a good choice of basis P? • pi are the rows of P These questions must be answered by next asking our- • selves what features we would like Y to exhibit. Evi- x are the columns of X (or individual X~). • i dently, additional assumptions beyond linearity are re- y are the columns of Y. quired to arrive at a reasonable result. The selection of i • these assumptions is the subject of the next section. Equation1representsachangeofbasisandthuscanhave many interpretations. 1. P is a matrix that transforms X into Y. IV. VARIANCE AND THE GOAL 2. Geometrically, P is a rotation and a stretch which Now comes the most important question: what does again transforms X into Y. “best express” the data mean? This section will build up an intuitive answer to this question and along the way 3. TherowsofP, p ,...,p ,areasetofnewbasis { 1 m} tack on additional assumptions. The end of this section vectors for expressing the columns of X. will conclude with a mathematical goal for deciphering The latter interpretation is not obvious but can be seen “garbled” data. by writing out the explicit dot products of PX. In a linear system “garbled” can refer to only three potentialconfounds: noise,rotationandredundancy. Let p 1 us deal with each situation individually. PX = .. x x . 1 n ··· pm (cid:2) (cid:3) A. Noise and Rotation p x p x 1 1 1 n · ··· · Y = ... ... ... Measurementnoiseinanydatasetmustbeloworelse, nomattertheanalysistechnique,noinformationabouta pm x1 pm xn · ··· · system can be extracted. There exists no absolute scale for noise but rather all noise is measured relative to the measurement. A common measure is the signal-to-noise ratio (SNR), or a ratio of variances σ2, 1 A subtle point it is, but we have already assumed linearity by implicitly stating that the data set even characterizes the dy- σ2 namicsofthesystem. Inotherwords,wearealreadyrelyingon SNR= σs2ignal. (2) the superposition principal of linearity to believe that the data noise providesanabilitytointerpolatebetweenindividualdatapoints 2 Inthissectionxi andyi arecolumnvectors,butbeforewarned. A high SNR (≫1) indicates high precision data, while a Inallothersectionsxi andyi arerowvectors. low SNR indicates noise contaminated data. 4 variance along p* p* yA R N S p* (a) xA (b) 0 angle of p4 5(degrees) 90 FIG. 2 (a) Simulated data of (xA,yA) for camera A. The signal and noise variances σ2 and σ2 are graphically FIG.3 Aspectrumofpossibleredundanciesindatafromthe signal noise representedbythetwolinessubtendingthecloudofdata. (b) two separate recordings r1 and r2 (e.g. xA,yB). The best-fit Rotating these axes finds an optimal p∗ where the variance line r2 =kr1 is indicated by the dashed line. and SNR are maximized. The SNR is defined as the ratio of the variance along p∗ and the variance in the perpindicular direction. record the same dynamic information. Reconsider Fig- ure 2 and ask whether it was really necessary to record 2 variables? Figure 3 might reflect a range of possibile Pretend we plotted all data from camera A from the plots between two arbitrary measurement types r1 and springinFigure2a. Rememberingthatthespringtravels r2. Panel (a) depicts two recordings with no apparent in a straight line, every individual camera should record relationship. In other words, r1 is entirely uncorrelated motion in a straight line as well. Therefore, any spread with r2. Because one can not predict r1 from r2, one deviating from straight-line motion must be noise. The says that r1 and r2 are statistcally independent. This varianceduetothesignalandnoiseareindicatedbyeach situation could occur by plotting two variables such as line in the diagram. The ratio of the two lengths, the (x ,humidity). A SNR, thus measures how “fat” the cloud is - a range of On the other extreme, Figure 3c depicts highly corre- possiblitiesincludeathinline(SNR 1),aperfectcircle lated recordings. This extremity might be achieved by ≫ (SNR = 1) or even worse. By positing reasonably good several means: measurements, quantitatively we assume that directions with largest variances in our measurement vector space A plot of (x ,x ) if cameras A and B are very A B • contain the dynamics of interest. In Figure 2 the di- nearby. rection with the largest variance is not xˆ = (1,0) nor A yˆA = (0,1), but the direction along the long axis of the • A plot of (xA,x˜A) where xA is in meters and x˜A is cloud. Thus, by assumption the dynamics of interest ex- in inches. istalongdirectionswithlargestvarianceandpresumably Clearly in panel (c) it would be more meaningful to just highest SNR . haverecordedasinglevariable,notboth. Why? Because Our assumption suggests that the basis for which we onecancalculater1fromr2(orviceversa)usingthebest- aresearchingisnotthenaivebasis(xˆ ,yˆ )becausethese A A fit line. Recording solely one response would express the directions do not correspond to the directions of largest data more concisely and reduce the number of sensor variance. Maximizing the variance (and by assumption recordings (2 1 variables). Indeed, this is the very the SNR ) corresponds to finding the appropriate rota- → idea behind dimensional reduction. tion of the naive basis. This intuition corresponds to finding the direction p in Figure 2b. How do we find ∗ p ? Inthe2-dimensionalcaseofFigure2a, p fallsalong ∗ ∗ C. Covariance Matrix thedirectionofthebest-fitlineforthedatacloud. Thus, rotating the naive basis to lie parallel to p would reveal ∗ In a 2 variable case it is simple to identify redundant the direction of motion of the spring for the 2-D case. cases by finding the slope of the best-fit line and judging Howdowegeneralizethisnotiontoanarbitrarynumber thequality ofthefit. Howdowequantifyandgeneralize ofdimensions? Beforeweapproachthisquestionweneed thesenotionstoarbitrariliyhigherdimensions? Consider to examine this issue from a second perspective. two sets of measurements with zero means 3 A= a1,a2,...,an , B = b1,b2,...,bn B. Redundancy { } { } Figure 2 hints at an additional confounding factor in our data - redundancy. This issue is particularly evident 3 These data sets are in mean deviation form because the means intheexampleofthespring. Inthiscasemultiplesensors havebeensubtractedofforarezero. 5 where the subscript denotes the sample number. The ConsiderhowthematrixformXXT, inthatorder, com- variance of A and B are individually defined as, putes the desired value for the ijth element of C . X Specifically, the ijth element of C is the dot product X 2 2 σA =haiaiii , σB =hbibiii betweenthevectoroftheith measurementtypewiththe vector of the jth measurement type (or substituting x where the expectation4 is the average over n variables. and x for a and b into Equation 3). We can summarizei j The covariance between A and B is a straight-forward several properties of C : X generalization. C is a square symmetric m m matrix. X 2 • × covarianceof AandB σ = a b ≡ AB h i iii The diagonal terms of C are the variance of par- X • The covariance measures the degree of the linear rela- ticular measurement types. tionship between two variables. A large (small) value in- The off-diagonal terms of C are the covariance dicates high (low) redundancy. In the case of Figures 2a • X between measurement types. and 3c, the covariances are quite large. Some additional facts about the covariance. C captures the correlations between all possible pairs X σ2 0 (non-negative). σ is zero if and only if ofmeasurements. Thecorrelationvaluesreflectthenoise • AB ≥ AB and redundancy in our measurements. A and B are entirely uncorrelated. • σA2B =σA2 if A=B. • Ivnaltuheesdicaogrorensaplotnedrmst,obyinatsesruemstpintgiond,ylanragmei(cssma(lolr) WecanequivalentlyconvertAandB intocorresponding noise). row vectors. In the off-diagonal terms large (small) values cor- • respond to high (low) redundancy. a = [a1 a2 ... an] b = [b1 b2 ... bn] Pretend we have the option of manipulating CX. We will suggestively define our manipulated covariance ma- sothatwemaynowexpressthecovarianceasadotprod- trix C . What features do we want to optimize in C ? Y Y uct matrix computation. 1 σ2 abT (3) D. Diagonalize the Covariance Matrix ab ≡ n 1 − where 1 is a constant for normalization.5 We can summarize the last two sections by stating n 1 that our goals are (1) to minimize redundancy, mea- Finall−y, we can generalize from two vectors to an arbi- sured by covariance, and (2) maximize the signal, mea- trary number. We can rename the row vectors x a, 1 ≡ sured by variance. By definition covariances must be x b and consider additional indexed row vectors 2 ≡ non-negative, thus the minimal covariance is zero. What x ,...,x . Now we can define a new m n matrix X3. m × wouldtheoptimizedcovariancematrixCYlooklike? Ev- idently, in an “optimized” matrix all off-diagonal terms x1 in CY are zero. Thus, CY must be diagonal. X= .. (4) There are many methods for diagonalizing CY. It is . curious to note that PCA arguably selects the easiest xm method, perhaps accounting for its widespread applica- tion. One interpretation of X is the following. Each row of X PCA assumes that all basis vectors p ,...,p are 1 m correspondstoallmeasurementsofaparticulartype(xi). orthonormal (i.e. p p =δ ). In the l{anguage of l}inear i j ij Each column of X corresponds to a set of measurements algebra, PCA assume·s P is an orthonormal matrix. Sec- fromoneparticulartrial(thisisX~ fromsection3.1). We ondly PCA assumes the directions with the largest vari- now arrive at a definition for the covariance matrix CX. ances the signals and the most “important.” In other words, they are the most principal. Why are these as- 1 C XXT. (5) sumptions easiest? X ≡ n 1 Envision how PCA works. In our simple example in − Figure2b,P actsasageneralizedrotationtoalignabasis with the maximally variant axis. In multiple dimensions 4 h·ii denotestheaverageovervaluesindexedbyi. this could be performed by a simple algorithm: 5 The most straight-forward normalization is 1. However, this n 1. Select a normalized direction in m-dimensional provides a biased estimation of variance particularly for small space along which the variance in X is maximized. n. Itisbeyondthescopeofthispapertoshowthattheproper normalizationforanunbiasedestimatoris n11. Save this vector as p1. − 6 2. Findanotherdirectionalongwhichvarianceismax- this assumption formally guarantees that the imized, however, because of the orthonormality SNR and the covariance matrix fully charac- condition, restrict the search to all directions per- terize the noise and redundancies. pindicular to all previous selected directions. Save this vector as p i 3. Repeatthisprocedureuntilmvectors areselected. III. Large variances have important dynamics. This assumption also encompasses the belief The resulting ordered set of p’s are the principal compo- that the data has a high SNR. Hence, prin- nents. cipal components with larger associated vari- In principle this simple algorithm works, however that ances represent interesting dynamics, while would bely the true reason why the orthonormality as- those with lower variances represent noise. sumption is particularly judicious. Namely, the true benefit to this assumption is that it makes the solution amenable to linear algebra. There exist decompositions IV. The principal components are orthogonal. that can provide efficient, explicit algebraic solutions. This assumption provides an intuitive sim- Notice what we also gained with the second assump- plification that makes PCA soluble with lin- tion. We have a method for judging the “importance” of ear algebra decomposition techniques. These the each prinicipal direction. Namely, the variances as- techniques are highlighted in the two follow- sociated with each direction p quantify how “principal” i ing sections. each direction is. We could thus rank-order each basis vector p according to the corresponding variances.We i will now pause to review the implications of all the as- sumptions made to arrive at this mathematical goal. We have discussed all aspects of deriving PCA - what remainarethelinearalgebrasolutions. Thefirstsolution is somewhat straightforward while the second solution E. Summary of Assumptions and Limits involvesunderstandinganimportantalgebraicdecompo- sition. This section provides an important context for under- standing when PCA might perform poorly as well as a road map for understanding new extensions to PCA. The Discussion returns to this issue and provides a more lengthy discussion. V. SOLVING PCA: EIGENVECTORS OF COVARIANCE I. Linearity Linearity frames the problem as a change of We derive our first algebraic solution to PCA using basis. Severalareasofresearchhaveexplored linear algebra. This solution is based on an important howapplyinganonlinearitypriortoperform- property of eigenvector decomposition. Once again, the ing PCA could extend this algorithm - this data set is X, an m n matrix, where m is the number has been termed kernel PCA. of measurement type×s and n is the number of samples. The goal is summarized as follows. II. Mean and variance are sufficient statistics. Theformalismofsufficientstatisticscaptures thenotionthatthemeanandthevarianceen- Find some orthonormal matrix P where tirelydescribeaprobabilitydistribution. The Y =PX such that C 1 YYT is diago- onlyclassofprobabilitydistributionsthatare Y ≡ n 1 nalized. TherowsofParet−heprincipal com- fully described by the first two moments are ponents of X. exponential distributions (e.g. Gaussian, Ex- ponential, etc). Inorderforthisassumptiontohold,theprob- ability distribution of xi must be exponen- statisticalindependence. tially distributed. Deviations from this could invalidate this assumption.6 On the flip side, P(y1,y2)=P(y1)P(y2) where P(·) denotes the probability density. The class of algo- rithms that attempt to find the y1, ,..., ym that satisfy this statistical constraint are termed independent component analy- sis (ICA). In practice though, quite a lot of real world data are 6 Asidebarquestion: WhatifxiarenotGaussiandistributed? Gaussian distributed - thanks to the Central Limit Theorem - Diagonalizing a covariance matrix might not produce satisfac- and PCA is usually a robust solution to slight deviations from toryresults. Themostrigorousformofremovingredundancyis thisassumption. 7 We begin by rewriting C in terms of our variable of In practice computing PCA of a data set X entails (1) Y choice P. subtracting off the mean of each measurement type and (2) computing theeigenvectors ofXXT. This solution is 1 C = YYT encapsulated in demonstration Matlab code included in Y n 1 − Appendix B. 1 = (PX)(PX)T n 1 − 1 = PXXTPT VI. A MORE GENERAL SOLUTION: SVD n 1 − 1 = P(XXT)PT This section is the most mathematically involved and n 1 canbeskippedwithoutmuchlossofcontinuity. Itispre- − C = 1 PAPT sented solely for completeness. We derive another alge- Y n 1 braicsolutionforPCAandintheprocess,findthatPCA − iscloselyrelatedtosingular valuedecomposition (SVD). Note that we defined a new matrix A XXT, where A In fact, the two are so intimately related that the names ≡ is symmetric (by Theorem 2 of Appendix A). are often used interchangeably. What we will see though Our roadmap is to recognize that a symmetric matrix is that SVD is a more general method of understanding (A) is diagonalized by an orthogonal matrix of its eigen- change of basis. vectors (by Theorems 3 and 4 from Appendix A). For a We begin by quickly deriving the decomposition. In symmetric matrix A Theorem 4 provides: the following section we interpret the decomposition and A=EDET (6) in the last section we relate these results to PCA. whereDisadiagonalmatrixandEisamatrixofeigen- vectors of A arranged as columns. A. Singular Value Decomposition The matrix A has r m orthonormal eigenvectors where r is the rank of th≤e matrix. The rank of A is less Let X be an arbitrary n m matrix7 and XTX be a × than m when A is degenerate or all data occupy a sub- rank r, square, symmetric n n matrix. In a seemingly × space of dimension r m. Maintaining the constraint of unmotivated fashion, let us define all of the quantities of ≤ orthogonality, we can remedy this situation by selecting interest. (m r) additional orthonormal vectors to “fill up” the mat−rix E. These additional vectors do not effect the fi- vˆ1,vˆ2,...,vˆr is the set of orthonormal • { } nal solution because the variances associated with these m 1 eigenvectors with associated eigenval- × directions are zero. ues λ1,λ2,...,λr for the symmetric matrix Now comes the trick. We select the matrix P to be a XTX{. } matrix where each row p is an eigenvector of XXT. By i thisselection,P ET. SubstitutingintoEquation6,we (XTX)vˆi =λivˆi find A = PTDP≡. With this relation and Theorem 1 of Appendix A (P−1 =PT) we can finish evaluating CY. σi √λi are positive real and termed the singular • ≡ values. 1 C = PAPT Y n 1 − uˆ1,uˆ2,...,uˆr isthesetoforthonormaln 1vec- = n 1 1P(PTDP)PT • t{ors defined by}uˆi ≡ σ1iXvˆi. × − 1 We obtain the final definition by referring to Theorem 5 = (PPT)D(PPT) n 1 ofAppendixA.Thefinaldefinitionincludestwonewand −1 unexpected properties. 1 1 = (PP− )D(PP− ) n 1 −1 • uˆi·uˆj =δij C = D Y n 1 Xvˆ =σ − i i • k k It is evident that the choice of P diagonalizes C . This Y ThesepropertiesarebothproveninTheorem5. Wenow was the goal for PCA. We can summarize the results of haveallofthepiecestoconstructthedecomposition. The PCA in the matrices P and C . Y TheprincipalcomponentsofXaretheeigenvectors • of XXT; or the rows of P. 7 Noticethatinthissectiononlywearereversingconventionfrom The ith diagonal value of C is the variance of X Y m×nton×m. Thereasonforthisderivationwillbecomeclear • along pi. insection6.3. 8 “value” version of singular value decomposition is just a B. Interpreting SVD restatement of the third definition. The final form of SVD (Equation 9) is a concise but Xvˆ =σ uˆ (7) thickstatementtounderstand. Letusinsteadreinterpret i i i Equation 7. This result says a quite a bit. X multiplied by an eigen- vector of XTX is equal to a scalar times another vec- Xa=kb tor. The set of eigenvectors vˆ1,vˆ2,...,vˆr and the set { } of vectors uˆ1,uˆ2,...,uˆr are both orthonormal sets or where a and b are column vectors and k is a scalar con- { } bases in r-dimensional space. stant. Theset vˆ ,vˆ ,...,vˆ isanalogoustoaandthe 1 2 m { } We can summarize this result for all vectors in one set uˆ1,uˆ2,...,uˆn is analogous to b. What is unique { } matrix multiplication by following the prescribed con- thoughisthat vˆ1,vˆ2,...,vˆm and uˆ1,uˆ2,...,uˆn are { } { } struction in Figure 4. We start by constructing a new orthonormalsetsofvectorswhichspananmorndimen- diagonal matrix Σ. sional space, respectively. In particular, loosely speak- ing these sets appear to span all possible “inputs” (a) and “outputs” (b). Can we formalize the view that σ˜1 0 vˆ ,vˆ ,...,vˆ and uˆ ,uˆ ,...,uˆ span all possible ... {“in1put2s” and n“o}utputs{”?1 2 n} Σ σr˜ We can manipulate Equation 9 to make this fuzzy hy- ≡ 0 0 pothesis more precise. ... X = UΣVT 0 UTX = ΣVT whereσ˜1 σ˜2 ... σr˜aretherank-orderedsetofsin- UTX = Z ≥ ≥ ≥ gular values. Likewise we construct accompanying or- thogonal matrices V and U. wherewehavedefinedZ ΣVT. Notethattheprevious columns uˆ ,uˆ ,...,uˆ ≡are now rows in UT. Compar- 1 2 n { } V = [vˆ˜1 vˆ˜2 ... vˆm˜] finorgmthtihseesqaumateiornoletoasEqupˆat,ipˆon,.1.,.{,pˆuˆ1,uˆ.2,H..e.n,cuˆe,n}UpTeris- U = [uˆ˜1 uˆ˜2 ... uˆn˜] a change of basis from {X1to Z2. Justmas}before, we were transforming column vectors, we can again infer that we wherewehaveappendedanadditional(m r)and(n r) are transforming column vectors. The fact that the or- − − orthonormal vectors to “fill up” the matrices for V and thonormal basis UT (or P) transforms column vectors U respectively8. Figure 4 provides a graphical represen- means that UT is a basis that spans the columns of X. tation of how all of the pieces fit together to form the Basesthatspanthecolumnsaretermedthecolumnspace matrix version of SVD. of X. The column space formalizes the notion of what are the possible “outputs” of any matrix. XV=UΣ (8) There is a funny symmetry to SVD such that we can define a similar quantity - the row space. where each column of V andUperform the “value” ver- sion of the decomposition (Equation 7). Because V is XV = ΣU orthogonal, we can multiply both sides by V−1 =VT to (XV)T = (ΣU)T arrive at the final form of the decomposition. VTXT = UTΣ X=UΣVT (9) VTXT = Z where we have defined Z UTΣ. Again the rows of Althoughitwasderivedwithoutmotivation,thisdecom- VT (or the columns of V)≡are an orthonormal basis for position is quite powerful. Equation 9 states that any transformingXT intoZ. BecauseofthetransposeonX, arbitrary matrix X can be converted to an orthogonal it follows that V is an orthonormal basis spanning the matrix,adiagonalmatrixandanotherorthogonalmatrix row space of X. The row space likewise formalizes the (or a rotation, a stretch and a second rotation). Making notion of what are possible “inputs” into an arbitrary sense of Equation 9 is the subject of the next section. matrix. We are only scratching the surface for understanding thefullimplications ofSVD.Forthepurposesofthis tu- 8 This is the same procedure used to fix the degeneracy in the torialthough,wehaveenoughinformationtounderstand previoussection. how PCA will fall within this framework. 9 The “value” form of SVD is expressed in equation 7. Xvˆi =σiuˆi Themathematicalintuitionbehindtheconstructionofthematrixformisthatwewanttoexpressalln“value”equations in just one equation. It is easiest to understand this process graphically. Drawing the matrices of equation 7 looks likes the following. We can construct three new matrices V, U and Σ. All singular values are first rank-ordered σ˜1 ≥σ˜2 ≥...≥σr˜, and the corresponding vectors are indexed in the same rank order. Each pair of associated vectors vˆ and uˆ is stacked in i i the ith column along their respective matrices. The corresponding singular value σi is placed along the diagonal (the iith position) of Σ. This generates the equation XV=UΣ, which looks like the following. The matrices V and U are m×m and n×n matrices respectively and Σ is a diagonal matrix with a few non-zero values (represented by the checkerboard) along its diagonal. Solving this single matrix equation solves all n “value” form equations. FIG. 4 How to construct the matrix form of SVD from the “value” form. C. SVD and PCA By construction YTY equals the covariance matrix of X. From section 5 we know that the principal compo- With similar computations it is evident that the two nentsofXaretheeigenvectorsofC . Ifwecalculatethe X methods are intimately related. Let us return to the SVD of Y, the columns of matrix V contain the eigen- original m n data matrix X. We can define a new vectors of YTY =C . Therefore, the columns of V are X matrix Y a×s an n m matrix9. the principal components of X. This second algorithm is × encapsulated in Matlab code included in Appendix B. 1 Y XT ≡ √n 1 − where each column of Y has zero mean. The definition of Y becomes clear by analyzing YTY. T 1 1 YTY = XT XT (cid:18)√n 1 (cid:19) (cid:18)√n 1 (cid:19) What does this mean? V spans the row space of Y = 1 −XTTXT − √n1 1XT. Therefore,Vmustalsospanthecolumnspac≡e n 1 of − 1 X. We can conclude that finding the principal −1 √n 1 = XXT compo−nents10 amounts to finding an orthonormal basis n 1 − that spans the column space of X. YTY = C X 10 If the final goal is to find an orthonormal basis for the coulmn 9 Yisoftheappropriaten×mdimensionslaidoutinthederiva- spaceofXthenwecancalculateitdirectlywithoutconstructing tion of section 6.1. This is the reason for the “flipping” of di- Y. By symmetry the columns of U produced by the SVD of mensionsin6.1andFigure4. 1 Xmustalsobetheprincipalcomponents. √n 1 − 10 FIG. 6 Non-Gaussian distributed data causes PCA to fail. In exponentially distributed data the axes with the largest variance do not correspond to the underlying basis. FIG.5 Datapoints(blackdots)trackingapersononaferris wheel. The extracted principal components are (p1,p2) and the phase is θˆ. assumptionsoutlinedinsection4.5andthencalculatethe correspondinganswer. Therearenoparameterstotweak VII. DISCUSSION AND CONCLUSIONS and no coefficients to adjust based on user experience - the answer is unique11 and independent of the user. A. Quick Summary This same strength can also be viewed as a weakness. If one knows a-priori some features of the structure of a Performing PCA is quite simple in practice. system,thenitmakessensetoincorporatetheseassump- 1. Organize a data set as an m n matrix, where m tions into a parametric algorithm - or an algorithm with × is the number of measurement types and n is the selected parameters. number of trials. Consider the recorded positions of a person on a fer- ris wheel over time in Figure 5. The probability distri- 2. Subtract off the mean for each measurement type butions along the axes are approximately Gaussian and or row x . i thus PCA finds (p ,p ), however this answer might not 1 2 3. Calculate the SVD or the eigenvectors of the co- be optimal. The most concise form of dimensional re- variance. ductionistorecognizethatthephase(oranglealongthe ferriswheel)containsalldynamicinformation. Thus,the In several fields of literature, many authors refer to the appropriate parametric algorithm is to first convert the individual measurement types x as the sources. The i datatotheappropriatelycenteredpolarcoordinatesand data projected into the principal components Y =PX then compute PCA. are termed the signals, because the projected data pre- This prior non-linear transformation is sometimes sumably represent the underlying items of interest. termedakerneltransformationandtheentireparametric algorithm is termed kernel PCA. Other common kernel transformations include Fourier and Gaussian transfor- B. Dimensional Reduction mations. This procedure is parametric because the user One benefit of PCA is that we can examine the vari- must incorporate prior knowledge of the structure in the ances C associated with the principle components. Of- selection of the kernel but it is also more optimal in the Y ten one finds that large variances associated with the sense that the structure is more concisely described. first k < m principal components, and then a precipi- Sometimesthoughtheassumptionsthemselvesaretoo tous drop-off. One can conclude that most interesting stringent. One might envision situations where the prin- dynamics occur only in the first k dimensions. cipal components need not be orthogonal. Furthermore, In the example of the spring, k = 1. This process the distributions along each dimension (xi) need not be of throwing out the less important axes can help reveal hidden, simplified dynamics in high dimensional data. This process is aptly named dimensional reduction. 11 To be absolutely precise, the principal components are not uniquely defined; only the sub-space is unique. One can always C. Limits and Extensions of PCA flipthedirectionoftheprincipalcomponentsbymultiplyingby −1. In addition, eigenvectors beyond the rank of a matrix (i.e. σi =0fori>rank)canbeselectedalmostatwhim. However, Both the strength and weakness of PCA is that it is thesedegreesoffreedomdonoteffectthequalitativefeaturesof a non-parametric analysis. One only needs to make the thesolutionnoradimensionalreduction.