ebook img

Multivariate Statistics with R PDF

189 Pages·1.824 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Multivariate Statistics with R

Multivariate Statistics with R Paul J. Hewson March 17, 2009 Multivariate Statistics Chapter 0 (cid:13)cPaul Hewson ii Contents 1 Multivariate data 1 1.1 The nature of multivariate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The role of multivariate investigations . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Summarisingmultivariatedata(presentingdataasamatrix,meanvectors,covariance matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3.1 Data display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Graphical and dynamic graphical methods . . . . . . . . . . . . . . . . . . . . . . . 3 1.4.1 Chernoff’s Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4.2 Scatterplots, pairwise scatterplots (draftsman plots) . . . . . . . . . . . . . . 5 1.4.3 Optional: 3d scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.4 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Animated exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Matrix manipulation 11 2.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Vector multiplication; the inner product . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Outer product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.3 Vector length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 iii Multivariate Statistics Chapter 0 2.1.4 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.5 Cauchy-Schwartz Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.6 Angle between vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Transposing matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Some special matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.3 Equality and addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.4 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Crossproduct matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Powers of matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.3 Rank of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Matrix inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Eigen values and eigen vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.7 Extended Cauchy-Schwarz Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.8 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3 Measures of distance 33 3.1 Mahalanobis Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.1 Distributional properties of the Mahalanobis distance . . . . . . . . . . . . . 35 3.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Distance between points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 (cid:13)cPaul Hewson iv Multivariate Statistics Chapter 0 3.3.1 Quantitative variables - Interval scaled . . . . . . . . . . . . . . . . . . . . . 38 3.3.2 Distance between variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.3 Quantitative variables: Ratio Scaled . . . . . . . . . . . . . . . . . . . . . . 42 3.3.4 Dichotomous data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.5 Qualitative variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.6 Different variable types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4 Properties of proximity matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4 Cluster analysis 51 4.1 Introduction to agglomerative hierarchical cluster analysis . . . . . . . . . . . . . . . 54 4.1.1 Nearest neighbour / Single Linkage. . . . . . . . . . . . . . . . . . . . . . . 54 4.1.2 Furthest neighbour / Complete linkage . . . . . . . . . . . . . . . . . . . . . 55 4.1.3 Group average link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1.4 Alternative methods for hierarchical cluster analysis . . . . . . . . . . . . . . 58 4.1.5 Problems with hierarchical cluster analysis . . . . . . . . . . . . . . . . . . . 59 4.1.6 Hierarchical clustering in R . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Cophenetic Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 Divisive hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.4.1 Partitioning around medoids . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.2 Hybrid Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5 K-centroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.6 Further information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 (cid:13)cPaul Hewson v Multivariate Statistics Chapter 0 5 Multidimensional scaling 71 5.1 Metric Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.1.1 Similarities with principal components analysis . . . . . . . . . . . . . . . . . 73 5.2 Visualising multivariate distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3 Assessing the quality of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3.1 Sammon Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6 Multivariate normality 79 6.1 Expectations and moments of continuous random functions . . . . . . . . . . . . . . 79 6.3 Multivariate normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.5.1 R estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.6 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7 Inference for the mean 85 7.1 Two sample Hotelling’s T2 test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.2 Constant Density Ellipses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.3 Multivariate Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8 Discriminant analysis 95 8.1 Fisher discimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8.2 Accuracy of discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.3 Importance of variables in discrimination . . . . . . . . . . . . . . . . . . . . . . . . 99 8.4 Canonical discriminant functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.5 Linear discrimination - a worked example . . . . . . . . . . . . . . . . . . . . . . . . 100 9 Principal component analysis 101 (cid:13)cPaul Hewson vi Multivariate Statistics Chapter 0 9.1 Derivation of Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . 103 9.1.1 A little geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 9.1.2 Principal Component Stability . . . . . . . . . . . . . . . . . . . . . . . . . 108 9.2 Some properties of principal components . . . . . . . . . . . . . . . . . . . . . . . . 110 9.8 Illustration of Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . 112 9.8.1 An illustration with the Sydney Heptatholon data . . . . . . . . . . . . . . . 112 9.8.2 Principal component scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 113 9.8.3 Prepackaged PCA function 1: princomp() . . . . . . . . . . . . . . . . . . 114 9.8.4 Inbuilt functions 2: prcomp() . . . . . . . . . . . . . . . . . . . . . . . . . 115 9.9 Principal Components Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 9.10 “Model” criticism for principal components analysis . . . . . . . . . . . . . . . . . . 117 9.10.1 Distribution theory for the Eigenvalues and Eigenvectors of a covariance matrix118 9.13 Sphericity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 9.15.1 Partial sphericity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 9.22 How many components to retain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 9.22.1 Data analytic diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 9.23.1 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 9.23.2 Forward search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 9.23.3 Assessing multivariate normality . . . . . . . . . . . . . . . . . . . . . . . . 138 9.25 Interpreting the principal components . . . . . . . . . . . . . . . . . . . . . . . . . 141 9.27 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 10 Canonical Correlation 143 10.1 Canonical variates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 (cid:13)cPaul Hewson vii Multivariate Statistics Chapter 0 10.2 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 10.3 Computer example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 10.3.1 Interpreting the canonical variables . . . . . . . . . . . . . . . . . . . . . . . 147 10.3.2 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 11 Factor analysis 149 11.1 Role of factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 11.2 The factor analysis model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 11.2.1 Centred and standardised data . . . . . . . . . . . . . . . . . . . . . . . . . 152 11.2.2 Factor indeterminacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 11.2.3 Strategy for factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 11.3 Principal component extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 11.3.1 Diagnostics for the factor model . . . . . . . . . . . . . . . . . . . . . . . . 158 11.3.2 Principal Factor solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 11.4 Maximum likelihood solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 11.5 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 11.6 Factor scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Bibliography 170 (cid:13)cPaul Hewson viii Books Many of the statistical analyses encountered to date consist of a single response variable and one or more explanatory variables. In this latter case, multiple regression, we regressed a single response (dependent)variableonanumberofexplanatory(independent)variables. Thisisoccasionallyreferred toas“multivariateregression”whichisallratherunfortunate. Thereisn’tanentirelyclear“canon”of whatisamultivariatetechniqueandwhatisn’t(onecouldarguethatdiscriminantanalysisinvolvesa singledependentvariable). However, wearegoingtoconsiderthesimultaneousanalysisofanumber of related variables. We may approach this in one of two ways. The first group of problems relates to classification, where attention is focussed on individuals who are more alike. In unsupervised classification (cluster analysis) we are concerned with a range of algorithms that at least try to identifyindividualswhoaremorealikeifnottodistinguishcleargroupsofindividuals. Therearealso awiderangeofscalingtechniqueswhichhelpusvisualisethesedifferencesinlowerdimensionality. In supervised classification (discriminant analysis) we already have information on group membership, and wish to develop rules from the data to classify future observations. The other group of problems concerns inter-relationships between variables. Again, we may be interestedinlowerdimensionthathelpusvisualiseagivendataset. Alternatively,wemaybeinterested to see how one group of variables is correlated with another group of variables. Finally, we may be interested in models for the interrelationships between variables. This book is still a work in progress. Currently it contains material used as notes to support a module at the University of Plymouth, where we work in conjunction with Johnson and Wichern (1998). It covers a reasonably established range of multivariate techniques. There isn’t however a clear “canon” of multivariate techniques, and some of the following books may also be of interest: Other Introductory level books: • Afifi and Clark (1990) • Chatfield and Collins (1980) • Dillon and Goldstein (1984) • Everitt and Dunn (1991) ix Multivariate Statistics Chapter 0 • Flury and Riedwyl (1988) • Johnson (1998) • Kendall (1975) • Hair et al. (1995) • et al. (1998) • Manly (1994) Intermediate level books: • Flury (1997) (My personal favourite) • Gnanadesikan (1997) • Harris (1985) • Krzanowski (2000) ?Krzanowski and Marriott (1994b) • Rencher (2002) • Morrison (2005) • Seber (1984) • Timm (1975) More advanced books: • Anderson (1984) • Bilodeau and Brenner (1999) • Giri (2003) • Mardia et al. (1979) • Muirhead (York) • Press (1982) • Srivastava and Carter (1983) (cid:13)cPaul Hewson x

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.